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The trouble with people is not that they don’t 
know but that they know so much that ain’t so. 


(Josh Billings 1874) 


| would have understood many things 
if no-one had explained them to me. 


(Stanistaw Jerzy Lec 1957) 


An expert is someone who has made all the 
mistakes which can be made in a narrow field. 


(Niels Bohr 1954) 
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lower and adjoint upper triangular system from the manuscript* of A.-L. Cholesky (1910), cf. §8.3 


+ Sur la résolution numérique des systémes d’équations linéaires », Fonds André-Louis Cholesky (1875-1918), 
courtesy of the Archives de l’Ecole Polytechnique, Palaiseau, France. 


Preface 


This book was developed from the lecture notes of an undergraduate level course 
for students of mathematics at the Technical University of Munich, consisting of 
two lectures per week. Its goal is to present and pass on a skill set of algorithmic 
and numerical thinking based on the fundamental problem set of numerical linear 
algebra. Limiting the scope to linear algebra creates a stronger thematic coherency 
in the course material than would be found in other introductory courses on 
numerical analysis. Beyond the didactic advantages, numerical linear algebra 
represents the basis for the field of numerical analysis, and should therefore be 
learned and mastered as early as possible. 

This exposition will emphasize the viability of partitioning vectors and matrices 
block-wise as compared to a more classic, component-by-component approach. In 
this way, we achieve not only a more lucid notation and shorter algorithms, but 
also a significant improvement in the execution time of calculations thanks to the 
ubiquity of modern vector processors and hierarchical memory architectures. 

The motto for this book will therefore be: 


A higher level of abstraction is actually an asset. 


During our discussion of error analysis we will be diving uncompromisingly deep 
into the relevant concepts in order to gain the greatest possible understanding 
of assessing numerical processes. More shallow approaches (e.g., the infamous 
“rules of thumb”) only result in unreliable, expensive and sometimes downright 
dangerous outcomes. 

The algorithms and accompanying numerical examples will be provided in the 
programming environment MATLAB, which is near to ubiquitous at universities 
around the world. Additionally, one can find the same examples programmed in 
the trailblazing numerical language Julia from MIT in Appendix B. My hope is 
that the following passages will not only pass on the intended knowledge, but 
also inspire further computer experiments. 

The accompanying e-book offers the ability to click through links in the passages. 
Links to references elsewhere in the book will appear blue, while external links 
will appear red. The latter will lead to explanations of terms and nomenclature 
which are assumed to be previous knowledge, or to supplementary material such 
as web-based computations, historical information and sources of the references. 


Munich, October 2017 Folkmar Bornemann 
bornemann@tum.de 
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viii Preface 


Student’s Laboratory 


In order to enhance the learning experience when reading this book, I recommend creating 
one’s own laboratory, and outfitting it with the following “tools”. 


Tool |: Programming Environment Due to the current prevalence of the numerical 
development environment MATLAB by MathWorks in both academic and industrial fields, 
we will be using it as our “go-to” scripting language in this book. Nevertheless, I would 
advise every reader to look into the programming language Julia from MIT. This inge- 
nious, forward-looking language is growing popularity, making it a language to learn. All 
programming examples in this book have been rewritten in Julia in Appendix B. 


Tool 2: Calculation Work Horse I will be devoting most of the pages of this book to 
the ideas and concepts of numerical linear algebra, and will avoid getting caught up with 
tedious calculations. Since many of these calculations are mechanical in nature, I encourage 
every reader to find a suitable “calculation work horse” to accomplish these tasks for them. 
Some convenient options include computer algebra systems such as Maple or Mathematica; 
the latter offers a free “one-liner” version online in the form of WolframAlpha. Several 
examples can be found as external links in §14. 


Tool 3: Textbook X In order to gain further perspective on the subject matter at hand, 
I recommend always having a second opinion, or “Textbook X”, within reach. Below I have 
listed a few excellent options: 
e Peter Deuflhard, Andreas Hohmann: Numerical Analysis in Modern Scientific Computing, 2nd ed., 
Springer-Verlag, New York, 2003. 
A refreshing book; according to the preface, my youthful enthusiasm helped form the presentation of error 
analysis. 
e Lloyd N. Trefethen, David Bau: Numerical Linear Algebra, Society of Industrial and Applied 
Mathematics, Philadelphia, 1997. 
A classic and an all time bestseller of the publisher, written in a lively voice. 
e James W. Demmel: Applied Numerical Linear Algebra, Society of Industrial and Applied Mathe- 
matics, Philadelphia, 1997. 
Deeper and more detailed than Trefethen—Bau, a classic as well. 


Tool 4: Reference Material For even greater immersion and as a starting point for 
further research, I strongly recommend the following works: 
e Gene H. Golub, Charles F. Van Loan: Matrix Computations, 4th ed., The Johns Hopkins University 
Press, Baltimore, 2013. 
The “Bible” on the topic. 
e Nicholas J. Higham: Accuracy and Stability of Numerical Algorithms, 2nd ed., Society of Industrial 
and Applied Mathematics, Philadelphia, 2002. 
The thorough modern standard reference for error analysis (without eigenvalue problems, though). 
e Roger A. Horn, Charles R. Johnson: Matrix Analysis, 2nd ed., Cambridge University Press, 
Cambridge, 2012. 


A classic on the topic of matrix theory; very thorough and detailed, a must-have reference. 
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| Computing with Matrices 


The purpose of computing is insight, not 
numbers. 


(Richard Hamming 1962) 


Applied mathematics is not engineering. 


(Paul Halmos 1981) 


| What is Numerical Analysis? 


1.1 Numerical analysis describes the construction and analysis of efficient discrete 
algorithms to solve continuous problems using large amounts of data. Here, these 
terms are meant as follows: 


e efficient: the austere use of “resources” such as computation time and com- 
puter memory; 


e continuous: the scalar field is, as in mathematical analysis, IR or C. 


1.2 There is, however, a fundamental discrepancy between the worlds of continu- 
ous and discrete mathematics: ultimately, the results of analytical limits require 
infinitely long computation time as well as infinite amounts of memory resources. 
In order to calculate the continuous problems efficiently, we have to suitably dis- 
cretize continuous quantities, therefore making them finite. Our tools for this task 
are machine numbers, iteration and approximation. This means that we purposefully 
allow the numerical results to differ from the “exact” solutions of our imagination, 
while controlling their accuracy. The trick is to find skillful ways to work with a 
balanced and predictable set of imprecisions in order to achieve greater efficiency. 


1.3 Numerical linear algebra is a fundamental discipline. The study of numerical 
linear algebra not only teaches the thought process required for numerical analysis, 
but also introduces the student to problems that are omnipresent in modern 
scientific computing. Were one unable to efficiently solve these fundamental 
problems, the quest for solving higher-ranking ones would be futile. 


© Springer International Publishing AG 2018 1 
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2 Matrix Calculus 


2.1 The language of numerical linear algebra is matrix calculus in IR and C. We 
will see that this is not merely a conceptional notation, but also a great help when 
striving to find efficient algorithms. By thinking in vectors and matrices instead 
of arrays of numbers, we greatly improve the usability of the subject matter. To 
this end, we will review a few topics from linear algebra, and adapt perspective 
as well as notation to our purposes. 


2.2 For ease of notation, we denote by K the field of either real numbers IR or 
complex numbers C. In both cases, 


K"*" = the vector space of m x n matrices. 


Furthermore, we identify K” = K"™! as column vectors, and K = K!*' as scalars, 
thereby including both in our matrix definition. We will refer to matrices in K!x™ 
as co-vectors or row vectors. 


2.3 In order to ease the reading of matrix calculus expressions, we will agree 
upon the following concise notation convention for this book: 


e a,B,Y,...,W: scalars 
e a,b,c,...,Z: vectors (= column vectors) 


e a',b'ic,...,2': 


co-vectors (= row vectors) 
e A,B,C,...,Z: matrices. 
Special cases include: 


e k,j,l,m,n,p: indices and dimensions with values in No. 


Remark. The notation for row vectors is compatible with the notation from §2.5 for the 
adjunction of matrices and vectors. This will prove itself useful in the course of this book. 


2.4 We can write a vector x € IK” or a matrix A € K”*” using their components 
in the following form: 


C1 My +++) Win 
x=] 2 | = (Cj)jatem A=]| : | = (jk) j=t:mk=tn- 
em Xm +'° &mn 
In this case, j] = 1: m is the short form of j = 1,2,...,m. This colon notation is 


widely used in numerical linear algebra and is supported by numerical program- 
ming languages such as Fortran 90/95, MATLAB and Julia. 
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2.5 The adjoint matrix A’ € IK"*™ of the matrix A € K"*" is defined as 


/ 


/ 
Ay tt Bin 


! ! 

a= = (jk )k=A:n,j=1:m 
! ! 

Min °° kinn 


with a’ = « for real scalars and a’ = & for complex scalars.! In the real case, one 
refers to a transposed matrix, while in the complex case we use the term Hermitian 
transposed Matrix. This way, the row vector 


x! = (C1,.--/ Em) 
is the adjoint co-vector of the vector x = (¢j)j=1:m- 


2.6 Instead of diassembling a matrix into its components, we will often partition 
matrices A € K”%*" into their column vectors a‘ € K™ (k = 1:n) 


A= qi... gt 


or row vectors a € KX" (j = 1: m) 
/ 
A= 
oo 


The process of adjunction swaps rows and columns: the columns of A’ are there- 
fore the vectors a; (j = 1: m), and the rows are the co-vectors (a*)’ (k = 1: n). 


Remark. Memory aid: the superscript indexes correspond to the “standing” vectors (column 
vectors), while the subscript indexes correspond to “lying” vectors (row vectors). 


2.7 The standard basis of K™ consists of the canonical unit vectors 
= (fF =h) ie (k=1:m), 
where we employ the practical syntax of the Iverson bracket:2 


[A] = 1 if the Statement A is correct, 


0 otherwise. 


1The ‘-notation for adjunction has been adopted from programming languages such as MATLAB 
and Julia, but can nevertheless also be found in the German functional analysis literature. 

?The Iverson-bracket is much more versatile than the Kronecker delta and deserves much wider 
recognition. A multi-faceted and virtuous exposure can be found in the classic textbook R. Graham, 
D. Knuth, O. Patashnik, Concrete Mathematics, 2nd ed., Addison Wesley, Reading, 1994. 
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The column vectors a‘ of a matrix A are defined as the image of the canonical unit 
vector e* under the linear map induced by A: 


a* = Ae (k=1:n). (2.1) 


2.8 Linearity therefore ensures that the image of the vector x = (€x)ket:n = 
yi, mek € K" is 
n 
K™ > Ax = Vo gal; 
k=1 
instead of using the term image, we will mostly refer to the matrix-vector product. 
In the special case of a co-vector y' = (1},---,1];,), we obtain 


n 


K > yx =) om Ski (2.2) 
k=1 
this expression is called the inner product of the vectors y and x.> If we then 
continue to read the matrix vector product row-wise, we attain 


Ax = 


ai, x 


2.9 The inner product of a vector x € K” with itself, in accordance with (2.2), 
fulfills 


n 


n 
ake CoS) Pou 
k=1 


k=1 
We can directly see that x'x = 0 = x = 0. The Euclidean norm we are familiar with 
from calculus is defined as ||x||2 = v’x'x. 


2.10 The product C = AB € K"™*? of two matrices A € K”*" and B € K"*? is 
defined as the composition of the corresponding linear maps. It therefore follows 
ck — Cek — (AB)e* = A(Be*) = Abk (k=1:p), 

thus with the results on the matrix-vector product 
| | a,b} --- al bP —a',B— 
AB @ Ab! ... AbP ©) : : 2) : . (2.3) 


| | a,b1 ++. al, bP —ai,,B— 


— 
= 


The last equality arises from the fact that we read the result row-wise. In particular, 
we see that the product Ax provides the same result, independent of whether 
we view x as a vector or an n x 1-matrix (therefore being consistent with the 
identification agreed upon in §2.2). 


3One often writes y’x = y - x; therefore the secondary name dot product. 
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2.11 Since adjunction is a linear involution* on K, the inner product (2.2) satisfies 
the relationship 


(y'x)! = x'y. 


The second formula in (2.3) leads directly to the adjunction rule 


(AB)! = BIA’. 
2.12 The case of xy’ € K"*" is called outer product of the vectors x € K”, y € K", 


/ 


iy, = * Gye, 
xy = : : 


Cm 1 oy Cat 


The outer product of x,y # 0 is a rank-1 matrix, whose image is actually spanned 
by the vector x: 
(xy)z=(y'z)-x (ze K"). 


EK 


2.13 For x € K” we will define the associated diagonal matrix by? 
o1 
62 
diag(x) = 
Gm 


Notice that diag : K” > K"*" is a linear map. With diag(e“) = ek - e, (where we 
set e, = e“) we therefore obtain the basis representation 


diag(x) = J de (ee) 
=1 


2.14 The unit matrix (identity) I €¢ K”*" is given by 


m 
I= > es (2.4) 
*. k=1 


4Involution on K: @” = @ for all € € K. 
5For such a componentwise matrix notation, we will agree that, for the sake of clarity, zero entries of 
a matrix may simply be omitted from printing. 
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It fulfills Te, = e, (k = 1: m) and therefore (linearity) Ix = x for all x € IK”; thus 
m 
x= (ex) -e. (2.5) 


This means ¢; = e,. x (k = 1: m). After adjusting the dimensions, it notably holds 
AI = A and JA = A. Since the row vectors of I are the ei, the third formula 
in (2.3) provides a formula dual to (2.1) 


aA=e4A (k=1:m). 
2.15 Due to AB = AIB for A € K”*" and B € K"*?, it follows from (2.4) 


n n 
AB= Y" Ack - eB = Yak - by, (2.6) 
k=1 k=1 


that is, a matrix product can be represented as sum of rank-1 matrices. 


2.16 There are a total of four formulas in (2.3.a-c) and (2.6) for the product of 
two matrices. One should realize that every one of these formulas, when written 
out componentwise, delivers the same formula as one is familiar with from 
introductory linear algebra courses (the components of A, B and C = AB thereby 
defined as a jx, Bxi, Yjl): 


Yjl = & KB G = 7; m,1 = 13 p). (2.7) 
k=1 


This componentwise formula is by far not as important as the others. Notice that 
we have, without much calculation and independent of this equation, conceptually 
derived the other formulas. 


2.17 All of the expressions for C = AB until now are actually special cases of a 
very general formula for the product of two block matrices: 


Lemma. If we partition A € K"*" and B € K"*? as block matrices 


An «+: At By --- Bis 


gic" Agr By ++ Bys 


with the submatrices Aj € IK™*"*, By € IK"**?P! where 


q r 8 
n=) im, n=)i nm, p= yop 
j=l k=1 I=1 
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then the product C = AB € K""“? is partitioned into blocks as follows: 


Ci ++ Cts : 
c=|: : wih Cye ) Agbe GSligi=1is). 28) 
Ca ae Ge k=l 


Remark. A comparison of the product formulas (2.7) and (2.8) shows that with such block 
partitioning, one can formally perform calculations “as if” the blocks were scalars. In the 
case of block matrices, one must pay careful attention to the correct order of the factors: in 
contrast to the componentwise fx) 0jx, the blockwise expression Bj; Ajx is actually almost 
always false and often meaningless since the dimensions do not necessarily match. 


Proof. In order to prove (2.8) we will partition the vectors x € IK” according to 
x=]: ], x EK" (k=1:7), 


and we consider the linear map N;, € K"*" defined by 
Nyx = Xk. 


Left multiplication with N; therefore achieves the selection of the block rows 
that belong to n,;; adjunction shows that right multiplication with Nj delivers the 
corresponding block columns. Similarly, we define Mj € K"/*" and P; € K?!*?. 
It therefore holds that 


Cy = MCP, Ajk = Mj AN, Bu = N, BP). 


The asserted formula (2.8) for block multiplication is therefore equivalent to 


‘ 
M,CP; = MA e Nin) BP. Geli iss) 
k=1 
and it thus follows from C = AB if only 


r 
yo NN, = Ie K"™". (*) 
k=1 


For x,y € K” with block parts x, = Nzx, ye = Ny this equality is successfully 
“tested” by executing the sum for the inner product blockwise, 


r if 
ee ea ee ee (x Nin) i 
k=1 k=1 


If we now let x and y traverse the standard basis of IK”, we finally obtain (*). 


Exercise. Show that the five formulas in (2.3), (2.6), (2.7) as well as C = AB itself are special 
cases of the block product (2.8). Visualize the corresponding block partitioning. 
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3 MATLAB 


3.1 MATLAB (MATrix LABoratory) is the brand name of a commercial software 
that is prominent in both in academic and industrial settings and widely used 
for numerical simulations, data acquisition, and data analysis.® By means of a 
simple scripting language, MATLAB offers an elegant interface to matrix-based 
numerical analysis, as well as state of the art performance by levering the BLAS 
Library (Basic Linear Algebra Subprograms) from the processor manufacturer 
along with the high-performance Fortran library LAPACK for linear systems of 
equations, normal forms and eigenvalue problems. 


MATLAB has become a standard skill that is currently required of every math- 
ematician and engineer. Learning this skill though concrete practical problems 
is strongly recommended, which is why we will be programming all of the algo- 
rithms in the book in MATLAB. A very short introduction (for readers with a little 
previous programming experience) can be found in Appendix A. 


3.2 In MATLAB scalars and vectors are both matrices; the identifications 
K” — 1 K=k*! 


are therefore design principles of the language. The fundamental operations of 
matrix calculus are: 


meaning formula MATLAB 
component of x Ck x(k) 
component of A Mik AG.k) 
column vector of A. ak AC:,k) 

row vector of A a; A(j,:) 
submatrix of A (Wie) miep kel A(m:p,n:1) 
adjoint matrix of A A’ WP? 

matrix product AB A*B 
identity matrix LEekrt eye(m) 
null matrix Oe k™x" zeros(m,n) 


6Current open-source-alternatives include Julia, which is extensively discussed in Appendix B, and 
the Python libraries NumPy and SciPy, see H. P. Langtangen: A Primer on Scientific Programming with 
Python, 5th ed., Springer-Verlag, Berlin, 2016. 
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3.3 In order to practice, let us write all five formulas (2.3.a-c), (2.6) and (2.7) for 
the product C = AB of the matrices A € K"™", B € K”*? as a MATLAB program: 


Program | (Matrix Product: column-wise). 


1 © = zeros(m,p); | | 

2 Bonet se-sp) C= Ab! : Ab? 
GCs 5ib) —@ Meas . aL) 8 

4 end | | 


Program 2 (Matrix Product: row-wise). 


1 © = zeros(m,p); —a,B— 
2 Lory = am! 

OCj 58) S AG 5 s)2238 i 
4 end i 


Program 3 (Matrix Product: inner product). 


1 C = zeros(m,p); 

> for j=t:m abl ++ ah bP 
aioe) ILESSE Se) 

1 C(j,1) = AGj,:) *BC: 1); : : 
end a,b! --- al bP 

6 end 


Program 4 (Matrix Product: outer product). 


1 © = zeros(m,p); 


n 
2 for k=1:n be. k . / 
3 C=C + AC:,k)*B(k,:); C=)ia bk 


4 end 


Program 5 (Matrix Product: componentwise). 


1 © = zeros(m,p); 
2 ROW ee ees 
for (E=1sp 


n 
4 for k=1:n 
5 C(j,1) = C(j,1) + ACGG,k)*B(K,1); = (3 sae) 
é j=lim) 


end 
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3.4 When we apply these programs to two random matrices A,B € JR1000% 1000 | 
we measure the following execution times (in seconds):’ 

program #for-loops MATLAB [s] C & BLAS [s] 

A*B 0 0.031 0.029 

column-wise 1 0.47 0.45 

row-wise 1 0.52 0.49 

outer product 1 3.5 0.75 

inner product 2 1.7 1.5 

componentwise 3 20 1.6 


To compare with a compiled language (i.e., translated into machine code), we 
have also executed the same programs implemented in C while still using the 
same optimized Fortran BLAS routines used by MATLAB. 

We can observe discrepancies in execution times of a factor of up to 600 in 
MATLAB code and up to 60 in C & BLAS, which we will try to explain in §4 
in order to better understand where they come from before moving on to other 
numerical problems. 

Remark. Check for yourself that all six programs really execute the exact same additions 
and multiplications, only in a different order. 


3.5 We can already observe one design pattern in use: the fewer for-loops, 
the faster the program. If it is possible to express an algorithm with matrix- 
matrix operations, it is faster than just working with matrix-vector operations; 
matrix-vector operations on the other hand are advantageous when compared to 
componentwise operations: a higher level of abstraction is actually an asset. 


4 Execution Times 


4.1 Cost factors for the execution time of a program in numerical analysis include: 
e floating point operations (i.e., the real arithmetic operations: +, —, -, /, ne ) 
e memory access 
e overhead (unaccounted operations and memory access) 


4.2 In an ideal world, the floating point operations (flop) would be the only cost 
factor of the execution time of a program and it would suffice to count them.® 


The execution times were measured on a MacBook Pro 13” with 3.0 GHz Intel Core i7 processor. 
8For K = C we must multiply the flop count of K = IR with an average factor of four: complex 
multiplication costs actually six real flop and complex addition two real flop. 
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problem dimension #flop #flop (m=n =p) 
x'y x,y € R” 2m 2m 
xy! x€ Ry eR" mn m- 
Ax AER™" xER" 2mn 2m? 
AB AeER™", BER"? 2mnp 2m? 


As in this table, we will only consider the leading order for growing dimensions: 


Example. As displayed in (2.2), the inner product in R” requires m multiplications 
and m — 1 additions, that is 2m — 1 operations in total; the leading order is 2m. 


4.3 If a single floating point operation costs us one unit of time top, the peak 
execution time (peak performance) is 


Tpeak = # flop 7 Flop 


which defines an upper limit of algorithmic performance. The ambition of nu- 
merical mathematicians, computer scientists and computer manufacturers alike, 
at least for high-dimensional problems, is to come as close as possible to peak 
performance by minimizing the time needed for memory access and overhead. 

Example. The CPU used in §3.4 has a peak performance of approximately 72 Gflop 
per second. This means, Tpeak = 0.028 for 2- 10° flop of the matrix multiplication 
where m = 1000; the actual execution time for program A*B was merely 4% slower. 


4.4 Modern computer architectures use pipelining in vector processors. Let us ex- 
amine the concepts with the help of the basic example of addition. 

Example. In order to add two floating point numbers, we must sequentially execute 
the steps in the following scheme (cf. §11.6): 


¢ — adjust — add normalize 
; <7 = $= or] 
4y = =— | exponents | — | mantissas exponent 


In this case, the processor needs 3 clock cycles for the addition. If we now add two 
m-dimensional floating-point vectors componentwise in this same way, we will 
need 3m clock cycles. Instead, let us now carry out the same operations using a 
kind of assembly line (pipeline). In this way, operations are executed simultaneously, 
and every work station completes its operation on the next component: 


adjust add normalize 
10 —? 9 —_ 8 
S10 6 6 =|) Gy — b 
0 —? 19 =f 118 


Thus, vectors of length m can be added in m+ 2 clock cycles and pipelining 
therefore accelerates the addition for large m by almost a factor of three. 
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4.5 In principle, memory access costs time. There are very fast memory chips, that 
are very costly, and slow memory chips that are more affordable. For this reason, 
modern computers work with a hierarchy of memories of different speeds that 
consist of a lot of inexpensive slow memory and a lesser amount of fast memory. 
Ordered from fast to slow, typical sizes for modern devices with a 3GHz CPU are: 


memory type typical size 
CPU register 128 B 


L1 cache 32 KiB 

L2 cache 256 KiB 
L3 cache 8 MiB 
RAM 8 GiB 

SSD 512 GiB 
HD 3 TiB 


The access speeds range from 20 GB/s to 100 MB/s, which correspond a difference 
of 2-3 in order of magnitude. 


4.6 By cleverly coordinating the memory access pattern, one can ensure that a 
vector processor always has its next operand in the pipeline and only has to input 
and output that operand once. We will label the number of memory accesses to 
main memory (RAM) for input and output as #iop and the time needed for the 
access as tiop. We can therefore write our optimized execution time 


; T 
T = #flop - top + #iops - tiop = Tpeak (1 + = , 
where the machine dependent measure T = ftiop/tflop is T ~ 30 for modern 
computer architectures, and the algorithm dependent efficiency ratio 


__ #flop 
~ #iop 


= flop per input-output operation. 


We therefore obtain in leading order (all dimensions = m): 


operation #flop #iop q 
x'y inner product 2m 2m 1 
uy outer product m mn? 1 
Ax matrix-vector product 2m? m 2. 
AB matrix-matrix product 2m 3m? 2m/3 
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Example. Now, we can quantitatively explain a number of differences from our table 
in §3.4: with q = 2000/3 and t 30, the problem A *B should only be 4% slower 
than peak performance, which, according to §4.3, is sure enough the case. The 
row-wise and column-wise versions should be a factor of 15 slower, which has 
also been shown to be true. Furthermore, outer product calculations are slower 
by a factor of approximately 30 (which is the case for C & BLAS). MATLAB 
already experiences significant overhead for the outer product, and in general for 
calculations that only use vectors or components. 


4.7 The BLAS Library was developed to encapsulate all hardware specific opti- 
mizations, and create a standard interface with which they could be accessed:? 

e Level-1 BLAS (1973): vector operations, replacing one for-loop. 

e Level-2 BLAS (1988): matrix-vector operations, replacing two for-loops. 


e Level-3 BLAS (1990): matrix-matrix operations, replacing three for-loops. 


Example. A selection of important BLAS routines (q = #flop/# iop):'° 


BLAS level name operation q 
1 xAXPY ytoaxt+y scaled addition 2/3 
1 xDOT at x'y inner product 1 
2 xGER A+< axy’+A __ outer product 3/2 
2 xGEMV y+ aAx+ fy matrix-vector product 2 
3 xGEMM C+ aAB+ fC matrix-matrix product m/2 


The notable efficiency gains of Level-3 BLAS are thanks to an efficiency ratio q 
that, as a matter of principle, scales linearly with the dimension m. 


MATLAB includes optimized BLAS for all popular platforms and encapsulates it 
in a scripting language that is very close to mathematical syntax. When we look 
at code variations we should always formulate our algorithms using the highest 
BLAS level: a higher level of abstraction is actually an asset. 


4.8 It is not coincidence that the row-wise multiplication from §3.4 is approxi- 
mately 10% slower than the column-wise variant. The programming language 


° Optimized implementations are delivered either by the CPU manufacturer or the ATLAS Project. 
The letter x in the names stands for either S, D, C or Z, meaning either single or double precision for 
both real and complex machine numbers (we will be discussing this topic in more detail in §12.4). 
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Fortran (and therefore BLAS, LAPACK and MATLAB) stores a matrix 


aq M10 
A= : 
m1 °°" &mn 
column-wise (column-major order): 
&q1 °° m1 | 412° °° &m2 ee Xin + &mn 


Therefore, unlike row-wise operations, operations on columns require no index 
calculations which would amount for further overhead; column-wise operations 
are therefore preferable. In contrast, C and Python store matrices row-wise (row- 
major order): 


M11°°*&1n | 421°" + &2n ws Xm1°** &mn 


In this case, row-wise algorithms should be used. 


4.9 The execution time of an algorithm is also influenced by the type of program- 
ming language used; the following factors play a role in slowing it down:!! 


e Assembler = Machine code (hardware dependent): 1x 


Compiler A 
e Fortran, C, C++ aompess machine code: 1x — 2x 


e MATLAB, Python, Julia ay Bytecode for virtual machine: 1x — 10x 


Compiler 


Optimizing compilers replace vector operations with Level-1 BLAS routines. 


5 Triangular Matrices 


5.1! Triangular matrices are important building blocks of numerical linear algebra. 
We define Jower and upper triangular matrices by the occupancy structure 


Components omitted from printing—as agreed upon in §2.13—represent zeros, ’*’ 
represent arbitrary elements in K. 


MIT = just in time 
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5.2 With V, = span{e,,...,e,} C K™, one can characterize such matrices more 
abstractly: U € K”*” is an upper triangular matrix if and only if!? 


UV, = span{ul,...,uk} C Vp (k=1:m), 
and L € K”*" is a lower triangular matrix if and only if 
VL = span{lj,...,4} CV (k=1:m). 
Remark. Lower triangular matrices are the adjoints of upper triangular matrices. 
5.3 Invertible lower (upper) triangular matrices are closed under multiplication 
and inversion. This means it holds: 


Lemma. Invertible lower (upper) triangular matrices in K"*™ form a subgroup in 
GL(m;K) relative to matrix multiplication.'$ 


Proof. For an invertible upper triangular matrix, it holds that UV, C V; and due to 
dimensional reasons even that UV; = V;. Therefore, U-'V, = Ve (k = 1: m) and 
U-! is upper triangular as well. Accordingly, if U;, U2 are upper triangular, 


U,zUnVv, C ULV, C Vi (k=1 :m). 


This means that UU is also upper triangular. Through adjunction, we arrive at 
the assertion for lower triangular matrices. 


5.4 Building on Laplace’s expansion it follows that 


My 
* Ad A2 

det] . ee =A,det} : °., See SA Am; 
* ok ree An 7 Am 


that is, the determinant of a triangular matrix is the product of its diagonal entries: 
A triangular matrix is invertible iff all its diagonal entries are non-zero. 


Triangular matrices whose diagonals only exhibit the value 1 are called wnipotent. 


Exercise. Show that the unipotent lower (upper) triangular matrices form a subgroup 
of GL(m;K). Hint: Make use of the fact that unipotent upper triangular matrices U are 
characterized by 

U=I+N, NY.CV_1 (k=1:m). 
Due to N™ = 0, such a matrix N is nilpotent. (Or, alternatively, prove the assertion through 
induction over the dimension by partitioning as in §5.5.) 


Here u!,...,u" are the columns of U, I},...,l/, are the rows of L and Vi = span{e/,...,e}. 
The general linear group GL(m;IK) consist of the invertible matrices in K™*™ under matrix 
8 group 
multiplication. 
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5.5 We now solve linear systems of equations with triangular matrices, such as 
Lx =b 


with a given vector b and an invertible lower triangular matrix L. We therefore set 
Lin = L, bm = b and xm = x and partition step by step according to 


Ly-1 a Xk-1 : D1 
Ly — E K 7 , Xk = E K , by 


! eK, 
Ty | Ak Ck Bx 


The two block-rows of the equation Lyx, = b; can by expanded to 


Ly—1Xp—-1 = be-1, Tey Xk—1 + AnGk = Br. 


The first equality means only that we are consistent in our labeling; the second can 
be solved for ¢, via the previous quantity x,_ and thus provides the transition 


Xk—1 +> Xp. 


The process begins with an empty" vector Xp (i.e., it does not actually appear in 
the partitioning of x,) and leads to the solution of x = x after m steps (dividing 
by Ax is allowed since L is invertible): 


Ck = (Be—MaXk-1)/Ak (k= 1: m). 


This remarkably simple algorithm is called forward substitution. 


Exercise. Formulate the back substitution algorithm to find the solution of Ux = b with an 
invertible upper triangular matrix U. 


5.6 The corresponding MATLAB program is: 
Program 6 (Forward Substitution to Solve Lx = b). 


x = zeros(m,1); 


2 for k=1:m 


x(k) = (b(k) - L(k,1:k-1)*x(1:k-1))/L(k,k); 
end 


The algorithm is realized with one for-loop over inner products; in practice one 
should employ an optimized Level-2 BLAS routine. 


M4Empty parts in a partitioning lead to convenient base clauses for induction proofs and initializations 
for algorithms; inner products of empty parts (i.e., zero-dimensional vectors) equal zero. 
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5.7 The inner products require 2k flop in leading order. The computational cost 
of the entire loop is therefore! 


m 
#flop =2 ) > k =m’. 
k=1 


The associated memory accesses for input and output are dominated by accessing 
the triangular matrix (in leading order m?/2 elements) such that 


q =#flop/#iop = 2. 


Remark. This means the computing time and the efficiency ratio q are equal (in leading 
order) to those of a matrix-vector multiplication with a triangular matrix. 


Forward and back substitution are standardized as Level-2 BLAS routines: 


BLAS level name operation #flop q 
x< Lx . ee ots 
2 xTRMV matrix-vector multiplication m2 2 
x< Rx 
xe Lox forward substitution 2 
2 xTRSV m 2 
x< Ro x back substitution 


The MATLAB commands to solve Lx = b and Ux = b are (MATLAB analyzes the 
matrix and calls xTRSV for triangular matrices): 


me lin, s = Ws 


5.8 As soon as ¢; is calculated, the components {;, are no longer needed for 
forward and back substitution; the memory from /, can therefore be used for ¢,: 


Bi Bo Bs ad Bm = (a Bo Bs foes Bin = 61 62 Bs aed Bm 


Sone ae C1 c2 &3 ‘<r om 
One refers to an algorithm that overwrites part of the input data with the output 
data as in situ (in place) execution. 
Like every other routine in the BLAS Library, xTRSV works in situ; the in situ 
version of the MATLAB program from §5.6 can be seen in the following code 
snippet: 


Program 7 (Forward Substitution for x + L~!x). 


for k=1i:m 
x(k) = (x(k) - L(k,1:k-1)*x(41:k-1))/L(k,k); 


3 end 


15’ represents equality in leading order. 
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6 Unitary Matrices 


6.1 Along with triangular matrices, there is another important group of matrices 
that belongs to the building blocks of numerical linear algebra. We call Q € K’"*™”" 
unitary (for K = R sometimes also orthogonal) if 


Q'=Q’ 
or, equivalently, 
QO=00'=1 
The solution of a linear system Qx = b is then simply x = Q’b; the computational 
cost of 2m? flop is two times larger than that of a triangular system.!¢ 
6.2 As with triangular matrices, unitary matrices form a group: 


Lemma. The unitary matrices in K"*™ form a subgroup of GL(m;IK) under matrix 
multiplication; this subgroup is denoted by U(m) for IK = C and by O(m) for K = R. 


Proof. Due to Q” = Q, if Q is unitary, the inverse Q~! = Q' is also unitary. For 
unitary matrices Qj, Q» it follows from 


(Q1Q>)'QiQ2 = Q5Q}Q1Q2 = Q51Q2 = Q)Q2 = 1 


that the product QQ is also unitary. 


6.3 The adjoints of the column vectors of Q are the row vectors of Q’: 
| | 4h 
Q=|m--- Gm|, Q= : 
bo =n — 
Therefore, the matrix product Q’Q = I as defined by (2.3.b) provides 
qn = i=) (j=1:m,l=1:m); 


Vector systems with this property form an orthonormal basis of IK". 


Remark. A rectangular matrix Q € K™*" with Q’Q = I € K"*" is called column-orthonormal 
since its columns form an orthonormal system. It this case we have n < m (why?). 


16Yet, the efficiency ratio remains q = 2. 
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6.4 If we write QQ’ = I using outer products as in (2.6), we obtain the following 
generalization of (2.4) for an orthonormal basis: 


T= )o ag 
k=1 


When applied to the vector x € IK”, this equation provides its expansion into the 
orthonormal basis as given by (2.5): 


The components of x with respect to qx are thus q/.x, they form the vector Q’x. 


Permutation Matrices 


6.5 The operation of swapping the columns ak of a matrix A € K"*™ according 
toa permutation!” 7t € Sm corresponds to a linear column operation, that is, a 
multiplication with some matrix P; from the right (such column operations affect 
every row in the same manner and are therefore structured like (2.3.c)): 


Since the columns of P; form an orthonormal basis, Pz is unitary. 


Exercise. Show that 77 ++ P, provides a group monomorphism Sm — U(m) (Sm — O(m)); the 
permutation matrices thus form a subgroup of GL(m;K) that is isomorphic to Sin. 


6.6 Accordingly, the adjoint P!, = P7' = P_1 swaps the rows a, of the matrix 
A € K"*" by means of multiplication from the left: 


— —e — 


— Fn(1) 
PLA = : , Le, Ph = PI = 
_— a (m) —_ _ ae, ~~ 


1 


A transposition T satisfies T ' = t and therefore P/. = P,. 


6.7 A permutation 7 € S,, is encoded as p = [7(1),...,7¢(m)] in MATLAB. In 
this way, the row and column permutations P!.A and AP, can be expressed as: 


IMG 58) 5 EMCS 512d) 
One therefore obtains the permutation matrix P, as follows: 


I = eye(m); P = I(:,p); 


"The group of all permutations of the set of indexes {1,2,...,m} is the symmetric group Sn. 


Il Matrix Factorization 


Although matrix factorization is not a new 
subject, | have found no evidence that is has 
been utilized so directly in the problem of 
solving matrix equations. 


(Paul Dwyer 1944) 


We may therefore interpret the elimination 
method as the combination of two tricks: First, 
it decomposes A into a product of two 
triangular matrices. Second, it forms the 
solution by a simple, explicit, inductive process. 


(John von Neumann, Herman Goldstine 1947) 


7 Triangular Decomposition 


7.1 The two “tricks” of the above 1947 von Neumann and Goldstine quote for 
solving a linear system of equations Ax = b can more formally be written as: 
(1) Factorize A into invertible lower and upper triangular matrices if possible, 
A= LU. 
We refer to this process as the triangular decomposition of A. 
(2) Calculate x by means of one forward and one back substitution 
Le= 0, Ux = z. 
It follows that b = L(Ux) = Ax. 


We normalize such a triangular decomposition by specifying L as unipotent. 


Remark. The elimination method one might be familiar with can be expressed as a combi- 
nation of both of the “tricks” listed above. More often than not, this method is attributed to 
Carl Friedrich Gauss, who spoke of eliminatio vulgaris (lat.: common elimination) in 1809, 
but did not himself invent the method.!8 We will use the term (Gaussian) elimination as a 
synonym of triangular decomposition. 


187, F. Grear: Mathematicians of Gaussian Elimination, Notices Amer. Math. Soc. 58, 782-792, 2011. 


© Springer International Publishing AG 2018 21 
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7.2 As it were, not every invertible matrix has a normalized triangular decompo- 
sition (we will be addressing this issue in §7.9), but nevertheless, if one exists, it is 
uniquely defined. 


Lemma. [f for a given A € GL(m;K) there exists a normalized triangular decomposi- 
tion A = LU (where L is unipotent lower triangular and U invertible upper triangular), 
then both factors are uniquely defined. 


Proof. Given two such factorizations A = L,U; = LyUp, then due to the group 
properties of (unipotent) triangular matrices, the matrix 


Ly ky = UaU;* 


is simultaneously unipotent lower triangular as well as upper triangular, and must 
therefore be the identity matrix. Thus L; = Lz and Uy; = Up. 


7.3 To calculate the normalized triangular decomposition A = LU of A € GL(m;K 
we let Aj = A, Li = L, U; = U and partition recursively according to 


Xk Uy 1 Xk u 
Ak = , Ly = , Uk = , (7.1a) 
by | By Ik | Lipa Uk+1 


whereby always 
Ay = Ly Uk. 


In the kth step, the row («,, u'.) of U and the column I; of L are calculated; in 
doing so, the dimension is reduced by one: 


artition Iculat 
Ar eaters Xk, UL, by, By SEE lk, Akit (k = 14 m). 
(7.1a) —S—’ (7.1b) & (7.1c) 


auxiliary quantities 


The resulting A,,1 provides the input for the following step. 
When we perform the multiplication for the second block row of Ay = L,Uk 
(the first row is identical by design), we obtain 


by = Tyo, = By = Uy, + Leg Uy - 
See are, 


=Apai 
For x, 4 0, we can simply solve for J, and Ax, 1 and are already done:!? 

I = Dy / tg, (7.1b) 

Agut = Be - [put (7.1¢c) 


Only in (7.1b) and (7.1c) we do actual calculations; (7.1a) is merely “book keeping”. 
Since #1,...,&m make up the diagonal of U, we have additionally proven: 


The matrix Ax+1 is called the Schur complement of a, in Ax. 
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Lemma. A matrix A € GL(m;K) has a normalized triangular decomposition if and 
only if all so-called pivot elements 01,..., 0m are non-zero. 


7.4 The normalized triangular decomposition can be executed in situ by overwrit- 
ing b, with |, and B, with A;,1. In the end, the location in memory that initially 
contained A has been overwritten by 


ay uy 


a us 


ln-1 Xin 


From this compact memory scheme we subsequently obtain both factors in the form 


1 ay uy 


Ip os “| U4 


In—-1 1 Xin 


7.5 In MATLAB, this kind of in situ execution of (7.1a)-(7.1c) can be completed 
with the help of the following commands: 


Program 8 (Trianglular Decomposition). 


object access in MATLAB 


XE A(k,k) 
uy, A(k,k+1:m) 
br, Ip A(k+1:m,k) 


By, Agy1 A(k+1i:m,k+1:m) 


1 for k=1:m 
A(Ck+i:m,k) = A(Ck+1i:m,k)/A(k,k); Vi iC ies altey9) 
ACk+1i:m,k+i:m) = A(k+1i:m,k+i:m) - A(k+i:m,k)*A(k,kti:m); % (7.1c) 
1 end 
Since the last loop iteration (k = m) only finds empty (zero dimensional) objects, we can 
just as well end the loop at k = m—1. 
The reconstruction of L and U is achieved with the help of the following commands: 
5 L = tril(A,-1) + eye(m); 
6 U = triu(A); 


Due to the -1 in tril(A,-1), only elements below the diagonal are read. 
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7.6 The number”? of floating point operations needed for triangular decomposi- 
tion is dictated in leading order by the rank 1 operation in (7.1c): 


# flop for the calculation of Ay 4, = 2(m —k)?. 


Therefore the total cost is 


m—1 


ye P= Zk 
k=1 3 


m 
# flop for LU-Factorization = 2 )*(m— ke SQ 
k=1 
This cost of triangular decomposition is furthermore the dominating factor in 
solving a linear system of equations (the subsequent substitutions for the second 
“trick” from §7.1 only require 2m? flop). 
Since A is only read once, and subsequently overwritten once by the compact 
memory access scheme for L and U, only 2m? memory accesses have to be carried 
out for input and output, leading to an efficiency ratio of 


q =#flop/#iop = _ 
Due to this linear dimensional growth, a memory access optimized Level-3 BLAS 
implementation can achieve near peak performance for large m. 


Exercise. Show that the cost for LU factorization is the same as for the associated multipli- 
cation: even the product LU of a given lower triangular matrix L with an upper triangular 
matrix U requires a leading order of 2m°/3 flop with an efficiency ratio of q = m/3. 


7.7 The approach to solving linear systems of equations Ax = b with the help 
of triangular decomposition as described in §7.1 offers many advantages; here are 
two examples: 


Example. Given n and the right-hand-sides by,...,b, € IK, we want to calculate 
the solutions x1,...,X, € K"™. From these vectors we form the matrices 


B=([b--) by |) EK™", X=] x1 +++ x) eK™ 
| | | | 


and calculate 
(1) the triangular decomposition A = LU (cost = 2m?/3 flop); 


(2) with the matrix version (Level-3-BLAS: xTRSM) of the forward and back sub- 
stitution solutions Z and X of (cost = 2n - m? flop) 


LZ=B, UX = Z. 


20We assume K = R by default; for IK = C we have to multiply, on average, with a factor of four. 
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For n < m the cost of the triangular decomposition outweighs the others. An 
example application will be discussed in §20.6. 


Example. For the simultaneous solution of Ax = b and A'y = c we calculate 
(1) triangular decomposition A = LU (and thereby, automatically, A’ = U'L'); 
(2) by forward and back substitution the solutions z, x, w and y of 
Lz=b, Ux =z, U'w=c, Liv 


7.8 As stated in §7.3, as soon as a; = 0 for just one pivot element, a (normalized) 
triangular decomposition of A does no longer exist. For example, the invertible 


matrix 
01 
s=(7 i) 


already exhibits a; = 0. This theoretical limitation is accompanied by a practical 
problem. For example, by replacing the zero in A with a small number, we get 


10-29 1 1 0 10~20 1 
A=( 1 1) = LU, L= (9 1)? u=( 0 i) 


On a computer only capable of calculations with 16 significant digits, rounding 
leads to the actual numbers (symbol for such a representation by rounding: ‘=’ 


1 — 107° = —0.99999 99999 99999 99999 .107° 
= —1,00000 00000 00000 -107° = —10°9. 


The subtraction of 1 from 1029 is therefore under the resolution threshold. Conse- 
quently, instead of U, the computer returns the rounded triangular matrix U, 


~ 10-29 1 
a-( 0 ae 


which corresponds to the triangular decomposition of 


~ a 10-29 4 
a-10-(%" 2) 


and not that of A. If we were to solve the system of equations defined by Ax = e1, 
instead of the computer correctly returning the rounded value 


ee. 


it would return the solution 
,_ [9 
#=(1) 


of At = ey, with an unacceptable error of 100% in the first component. 
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Conclusion. In general, take the “Zeroth Law of Numerical Analysis” to heart: 


If theoretical analysis struggles for « = 0, numerical analysis does so for x ~ 0. 


Triangular Decomposition with Partial Pivoting 


7.9 Once one realizes that the row numbering of a matrix is completely arbitrary, 
both problems mentioned in §7.8 can be solved with the following strategy: 


Partial Pivoting: In the first column of Ax, the first element with the 
largest absolute value in that column is selected as the pivot element «,; 
the associated row is then swapped with the first row of Ag. 


Thus, by calling % the transposition of the selected row numbers, we replace (7.1a) 
by the block partitioning (given A; = A, Li = L, U; = U) 


a, | Uy, 1 a, | uy, 
Pr, Ak = , Lk= , Up= (7.2a) 
i by | By Ik | Lesa Uk+1 
where?! 
[be] < Jax. 


We however cannot simply set FLAG = L,U,, since we must accumulate the row 
swaps of the subsequent steps in order to multiply consistently. To this end, we 
inductively introduce the permutation matrices 


1 


fo / 
= P P, (7.2b) 


and complete the triangular decomposition of the resulting matrix A; that is 
created after the successive row swaps have been completed: 


PLAg = L,Uk. (7.2c) 
By multiplying for the second row of blocks in this equation, we attain 


/ / / 
Prey = Mele, Ph Be = Tneg + Le Uk, 
+S 


—pi 
=P Aku 


which (after left multiplication with P,,1) can be simplified to 


/ 
by = XKVk, By = Ogu, + Ag+ (where OR = Prsalk). 


*1Such inequalities (and the absolute values) have to be read componentwise. 
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We have thus performed exactly the same computational steps as described in §7.3, 
namely 


Og = Og / Kk, (7.2d) 
Ak4i = By = Og Uy. (7.2e) 


The only difference between this approach and that of (7.1b) and (7.1c), is that we 
attain |; only after performing all remaining row-swaps on v;: 


Ik = Pea Dic (7.2f) 
Remark. The in situ variant of the algorithm is realized by executing the row swaps of all 


columns of the matrix in memory (i.e., for Ay as well as v1,..., 0x). 


7.10 The in situ execution of (7.2a)-(7.2e) can be written in MATLAB as: 


Program 9 (Triangular Decomposition with Partial Pivoting). 
We calculate P’A = LU for a given A. As in §6.7, we can realize the row permutation P’A 
as A(p,:) with the vector representation p of the corresponding permutation. 


p= i:m; % initialization of the permutation 


3 for k=1:m 


1 


[~,j] = max(abs(A(k:m,k))); j = k-1+j; % pivot search 
p({k j]) = p({j k]); ACLk j],:) = AC[j k],:); % row swap 

A(k+1:m,k) = A(k+i:m,k)/A(k,k); i Shs2cl) 
ACk+i:m,kt+i:m) = A(k+i:m,kti:m) - A(kt+i:m,k)*A(k,kti:m); % (7.2e) 


s end 


Here too, the reconstruction of L and U can be completed with the following commands 


Pmt, Grecia) ce ChycHGa)) 


) U = triu(A); 


In order to reach peak performance, one should replace this program with the 
MATLAB interface to the xGETRF routine from LAPACK: 


[LRU p INR siniGiee vector): 


Without rounding errors, A(p,:) =L*U would be valid. 


7.11 As it turns out, the algorithm in §7.9 works actually for all invertible matrices: 


Theorem. Given A € GL(m;K), the triangular decomposition with partial pivoting 
results in a permutation matrix P, a unipotent lower triangular matrix L and an invert- 
ible upper triangular matrix U where 


PA=LU, |L|<1. 


Notably, all pivot elements are non-zero. 


28 Matrix Factorization [ Ch. Il 


Proof. Through induction we can show that all A; are invertible and a; 4 0. By 
assumption A; = A is invertible (base case). As our induction hypothesis, we then 
stipulate A; is also invertible. If the first column of A, were to be the null vector, 
A; could not be invertible. Therefore, the first element with the largest absolute 
value is some a; # 0 and the division for the calculation v,; in (7.2d) is valid. From 
(7.2a)-(7.2e) it directly follows 


/ / 
1 1 | nan | ub nae | ui, 


PLA, = . = 
ai —U | £ by | Bh 0 | Aes 


—Oz I 


Since the left most entry, being the product of invertible matrices, is itself an 
invertible matrix, the triangular block matrix on the right must also be invertible. 
Thus, Ax+, is invertible and the induction step is complete. 

If we set k = 1 in (7.2c) and P = P,, we can see that P’A = LU. From |bx| < |axx| 
and (7.2d) follows |v,| < 1 and with it |,| < 1, such that above all |L| < 1.7 


7.12 The methods to find a solution of the linear system of equations discussed 
in §§7.1 and 7.7 do not change at all structurally when we transition from A = LU 
to P'A = LU. In this way, Ax = b becomes P’ Ax = P’b. We must only replace b 
with P’b in the resulting formula (i.e., permute the rows of b). 

In MATLAB, the solution X € K"*” of AX = B € K"*" is therefore given by: 


Biss Wy, fol] ICA, oweCR@Ie 2) B 
Z = L\B(p,:); 


> X = U\Z; 


or equivalently, if we do not require the decomposition P’A = LU for re-use: 


X = A\B; 


8 Cholesky Decomposition 


8.1 For many important applications, from geodesy to quantum mechanics to 
statistics, the following matrices are of utmost importance: 


Definition. A matrix A ¢ K”*” with A’ = A is called self-adjoint (for K = R 
often also symmetric and for K = C hermitian). The matrix A is furthermore called 
positive definite, when 

x Ax>0 (04xEK"). 
We will refer to self-adjoint positive definite matrices by the acronym s.p.d. . 


Remark. For self-adjoint A, the term x! Ax = x’ A’x = (x'Ax)’ is always real, even for K = C. 
Positive definite matrices have a trivial kernel and are therefore invertible. 


Recall that inequalities and absolute values have to be read componentwise. 
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8.2 If we partition a matrix A € K”*” in the form 


A; | B 
A= : Ape Ke* 
c|D 


then A; is called a principal submatrix of A. Generally, at least in theory, there is no 
need to pivot in getting the triangular decomposition of A when all the principal 
submatrices A, inherit some structure of the matrix A implying invertibility. 
Exercise. Show that A € GL(m;K) has a triangular decomposition if and only if all principal 
submatrices satisfy A; € GL(k;K) (k = 1: m). Hint: Partition similarly to §8.3. 

For example, along with A, the A, are s.p.d., too. The self-adjointness of the A, 
is clear and for 0 4 x € K* it follows from the positive definiteness of A that 


/ 


x Ax |B x 
0 eal fs 0 


x'ARx = >0. 


8.3 For s.p.d. matrices there exists a special form of triangular decomposition 
which generalizes the notion of the positive square root of positive numbers: 


Theorem (Cholesky Decomposition). Every s.p.d. matrix A can be uniquely represented 
in the form 

ae ee 
whereby L is a lower triangular matrix with a positive diagonal. The factor L can be 
constructed row-wise by means of the algorithm (8.1a)-(8.1c). 


Proof. We construct the Cholesky decomposition A, = Ly Ly of the principal 
submatrices A, of A = A, step by step by partitioning 


Axg_1 | a Le-1 ; Ly Tk 
, = . (8.1a) 
a, XK L, Ax Ax 


In the kth step, the row (I/,A;) of L is calculated. In doing this, we prove by 
induction that L;, is uniquely defined as a lower triangular matrix with a positive 
diagonal. When multiplied out, Ay; = L;Lj, shows initially 


Ax 1= Lr Ly ie Lr ilk = dk, be4 = Ak, Ky + xR = dX. 


The first equation is nothing more than the factorization from step k — 1; the third 
equation is the adjoint of the second; the second and fourth one can easily be 
solved as follows: 


k= Ei jae (forward substitution), (8.1b) 


Ak = 4/ XE — i (8.1c) 
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Here, in accordance with our induction hypothesis L,_; is uniquely defined as a 
lower triangular matrix with positive diagonal, and therefore notably invertible so 
that J, in (8.1b) is also uniquely given (there is nothing to be done for k = 1). We 
must still show that 

a, — Il, > 0, 


so that the (unique) positive square root for A, can be extracted from (8.1c). For 
this, we use the positive definiteness of the principal submatrix A, from §§8.2. 


Given the solution x; from Li_1x, = —I; it actually holds that 
/ 
Xk Xk yg | Ly_-1lk Xk 
0< Ax = ( xi. 1 ) 
1 f Koi | a 1 


! ! 1 Iyl _ ! 

— x Lp Ly _4 Xr + xX Le_aly + Ly 4X +X, — XK _ LL 

eee ee eae fT 
ah ahd, =, 


Due to Ax, > 0, Ly can now be formed as uniquely given lower triangular matrix 
with positive diagonal; the induction step is complete. 


Remark. The algorithm (8.1a)-(8.1c) was developed by André-Louis Cholesky from 1905 
to 1910 for the Service géographique de l’armée and in 1924 was posthumously published 
by Major Ernest Benoit; Cholesky’s hand-written manuscript from Dec. 2, 1910 was first 
discovered in his estate in 2004. For a long time, only a small number of french geodesists 
had knowledge of this method. This changed when John Todd held a lecture on numerical 
mathematics in 1946 at the King’s College of London, thereby introducing the world to the 
Cholesky decomposition.”3 


8.4 We code the Cholesky decomposition (8.1a)—(8.1c) in MATLAB as follows: 


Program 10 (Cholesky Decomposition of A). 


L = zeros(m); 
2 for k=1:m 
Whe S Ih Cal gieail 4 ihe Teal) \ACil gtemak .2e)) 9 7 {the ali) 
InG@e sakereaily Iie? 9 Zé (ale ley) 
L(k,k) = sqrt (A(k,k) - 1k’*1k); en C8iadic)) 
6 end 


Notice that the elements of A above the diagonal are not read by the computer 
(the algorithm “knows” the symmetry). In principle, this memory could be used 
for other purposes. 


Exercise. Modify the program so that it returns an error and a vector x € K” with x'Ax <0 
if the (self-adjoint) matrix A is not positive definite. 


?3More on this in C. Brezinski, D. Tournés: André-Louis Cholesky, Birkhauser, Basel, 2014. 
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In order to reach the peak performance, one should call the MATLAB interface 
to the xPOTREF routine from LAPACK: 


L = chol(A,’lower’); 
When U = L’, then A = U'U is an alternative form of the Cholesky decomposition. 


Since the factor U can be constructed column-wise in accordance with (8.1a), its 
calculation is slightly faster in LAPACK and MATLAB than that of L (cf. §4.8): 


Ui =" ‘chiolliCA)) 


8.5 The number of floating point operations needed for a Cholesky decomposi- 
tion is dominated by the operation count of the forward substitutions in (8.1b). 
Therefore, the total computational cost is 


# flop for Cholesky decomposition = iz 
k=1 


y 


~ Pe = 13 


and is thereby (asymptotically) only half as large as the cost of a normalized 
triangular decomposition, without exploiting symmetry, as found in §7.6. 

Since only the lower half of A must be read and only the triangular factor L 
must be stored, only m* memory accesses are necessary for input and output (in 
leading order). Hence, as with triangular decomposition the efficiency ratio is 


q = #flop/#iop = a 


so that with the help of Level-3 BLAS, an memory-access-optimized implementa- 
tion can reach near peak performance for large m. 


9 QR Decomposition 
9.1 We will refer to. A € K”*" as a matrix with full column rank when its columns 
are linearly independent. Such matrices can be characterized in a number of ways: 


Lemma. A full column rank of A € IK"*" is equivalent to each of the following prop- 
erties: 


(1) rank A = dimimA=n<m, (2) ker A = {0}, (3) A’A is s.p.d. 
The matrix A' A is called the Gramian matrix of the columns of A. 


Proof. The equivalence to (1) and (2) follows directly from §2.8 and should in fact 
be well known from past introductions to linear algebra. According to §2.11, A’A 
is self-adjoint and it holds according to §2.9 that 


x'(A'A)x = (Ax)'(Ax) >0 3 (x €K"). 


Due to (Ax)/(Ax) = 0 = Ax = 0, both (2) and (3) are equivalent. 
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9.2 Our goal is to factorize a matrix A € K”*” with full column rank as 
A=QR 

with Q € K"”*" being column orthonormal and R € GL(n;K) upper triangular. 

Such a QR decomposition is said to be normalized, if the diagonal of R is positive. 


Remark. Since the columns of Q span the image of A = QR, they form, by definition, an 
orthonormal basis of the image. 


9.3 The QR decomposition of A is closely related to the Cholesky decomposition 
of the Gramian matrix A’ A: 


Theorem. Every matrix A € K"*" with full column rank has a unique normalized 
QR decomposition. The factors can be determined as follows: 

A'A = R'R (Cholesky decomposition of A’ A), 

R‘Q' =A’ (forward substitution for Q’). 
Proof. We assume that A has a normalized QR decomposition. Then, 

A'A=R'VUQR=RR 
Se 
=I 
is according to Theorem 8.3 a unique Cholesky decomposition of the Gramian 
matrix A’A, which according to Theorem 9.1 is s.p.d.. Given the upper triangular 
factor R with positive diagonal, the expression 
Q= AR"! 


is therefore also uniquely defined. By showing the column orthonormality of the 
thus defined factor Q, we can conversely ensure the existence of the QR decompo- 
sition itself: 

CUS(" AGAR Dar ik I 
with L = R’ from the Cholesky decomposition A’A = R’R. 


9.4 The construction of the QR decomposition via the Cholesky decomposition 
of the Gramian A’A is only seldom used in practical numerical work for the 
following two reasons (see also §16.5): 


e It is more expensive for n © m than algorithms which work directly on A. 


e The orthonormality of the factor Q is not explicitly built into the process, but 
is rather only the implicit result of the theory. Such an indirect approach is 
extremely susceptible to the influence of rounding errors. 

Exercise. Show that the number of floating point operations in the algorithm from Theo- 
rem 9.3 is (in leading order) 2mn? + n3/3. Discuss the cases of n < mand n ~ m. 
4Recall from §6.3 that such matrices are defined by Q’Q = I € K"™". 
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Modified Gram-—Schmidt 


9.5 For the direct calculation of the normalized QR decomposition of A from §9.3, 
we set Ay = A, Qi; = Q, Ri = R and partition stepwise according to? 


/ 
Ay = (bi | Be) =QeRer Qe = (4% | Qest), Re = = . . (9.1a) 


In the kth step, the column gq, of Q and the row (px, ri.) of Rx are calculated: 


partition calculate 


b kr Br 
(9.1a) ——’ (9.1b)+(9.1e) 
auxiliary quantities 


Pks Wks Te Ak+1 (k =1: n). 


The output A;,; thus provides the input for the next step. If we expand the single 
block row of Ay = Q;,R,, we obtain 
be = qKPkr Be = 9 + Qe r Reva, 
SS 
=Any 


which we solve for x, qx, Thr Agz1- From ||q4\|2 = (qi.9x)!/2 = 1 we get with p, > 0 


Px = WlOxll2, (9.1b) 
9k = DK/ Pk, (9.1¢) 


and from q/.Qx+1 = 0 it follows that 


/ Vg / V3 / 

Be = Tk Ve + Mp Qk+1 Reza = Vp 

Nae ae 
=i me 


so that finally 
r= 9, Br, (9.1d) 
Anu => By = kV: (9.1e) 


The algorithm (9.1a)-(9.1e) is called modified Gram—Schmidt (MGS). 


Remark. In 1907, Erhard Schmidt described a process for orthonormalization which he 
attributed, in a footnote, to the 1879 dissertation of the insurance mathematician Jorgen 
Pedersen Gram (who had however used determinants); the process was first referred to as 
Gram-Schmidt by the statistician Y. K. Wong in 1935. The slightly modified Gram—Schmidt 
process is absolutely preferable for practical numerical work and can essentially be found 
in the famous fundamental work of Pierre-Simon Laplace on probability theory (1816).76 


>We will thereby employ Q, and R,; as submatrices of the factors Q and R, whose unique existence we 
have already ensured in Theorem 9.3. In particular, Q; is column orthogonal and R, is an upper 
triangular matrix with positive diagonal so that qi.qx = 1, q,Qk41 = 0 and px > 0. 

26S, J. Leon, A. Bjrck, W. Gander: Gram-Schmidt orthogonalization: 100 years and more, Numer. Linear 
Algebra Appl. 20, 492-532, 2013. 
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Exercise. Observe Ayi1 = (I — 9).qx)B, and show: (1) if a; is the kth column of A then 


PRok = (1 9e-1%-1) °° (91m )ak (k= 1: 0). 
(2) The orthogonal projection P = I — qq’ with q'q = 1 satisfies Pq = 0 and Pu = u if q’u =0. 


9.6 The MGS-Algorithm (9.1a)-(9.1e) can be executed in situ by overwriting b, 
with q, and B, with Ax, so that finally the matrix Q is found in memory where 
the input A was previously stored. In MATLAB, the process can be coded as: 


Program |! (QR Decomposition with the MGS Algorithm). 


object access in MATLAB 


rh R(k,k+1:n) 
bk, 9k AC: ,k) 
By, Agy1 AC: ,k+1:n) 
R = zeros(n); 
2 for k=1:n 
R(k,k) = norm(A(: ,k) ,2); % (9.1b) 
NG ie) S ACs oie) Ce sik) 3 Ve (oa ie)) 
R(k,k+i:n) = AC: ,k)’*A(:,k+i:n); i (Owalel)) 
AC:,kt+i:n) = AC:,k+i:n) - AC:,k)*R(k,kt+1i:n); % (9.1e) 
end 


After execution of the program, the MATLAB variable A contains the factor Q. 


9.7 The computational cost for QR decomposition is dictated in leading order by 
the Level-2 BLAS operations in (9.1d) and (9.1e): 


# flop for the computation of r, = 2m(n — k), 
# flop for the computation of Ag ., = 2m(n —k). 


The total computational cost is therefore 
n 
# flop for QR decomposition with MGS = 4m Y° (n — k) = 2mn?. 
k=1 
Since A is read and subsequently overwritten with Q and also R has to be stored, 
input-output requires 2mn + n*/2 memory accesses. Hence, the efficiency ratio is 


mn n n<m, 
q = #flop/#iop = ——__,,, © 
amin ne 72 =m nem. 


Thanks to Level-3 BLAS, peak performance can be approximately reached for 
large n; a large m alone, however, would not be enough. 
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9.8 For the calculation of an (often not normalized) QR decomposition, MATLAB 
offers the interface 
[Q,R] = qr(A,0); 


to the LAPACK routines xGEQRF, xORGQR und xUNGQR. These do not use the MGS al- 
gorithms but rather the Householder method; details can be found in Appendix D. 
The factor Q can thus be much more accurate (more on this later) and the factor R 
is of similar quality. Nevertheless, this gained accuracy does come with a cost of 
# flop for QR decomposition with Householder = 4mn* — an 
which is a factor of 4/3 to 2 higher than the MGS algorithm. These costs can 
however be reduced by a factor of 2 when Q is not explicitly computed and instead 
an implicit representation is used to calculate matrix-vector products such as Qx 
or Q'y. The computational cost of the latter operations is then proportional to mn. 
Unfortunately, MATLAB does not provide an interface to these representations. 
Still, the factor Q is not needed for many applications, and the factor R can be 
directly calculated with the commands 


R = triu(qr(A)); R = R(i:n,1:n); 
for half the cost, i.e., 2mn? — 2n3/3 flop. 


Givens Rotations 


9.9 The normalized QR decomposition of a vector 0 4 x € K? is given by 


1 7 61/0 2 
aaa P, p= ([[xll2. 
————_—~ iW 


_ 
=x =41 


By extending q1 to a suitable orthonormal basis q, q2, we directly obtain the 
unitary matrix (check it!) 


if & —%& 
Q=(H | 2) =p & ¢ ; detO = 1, 
for which it then holds 
IlxIl2 ; IlxIl2 
x=O j Ox = : (9.2) 
0 0 


By multiplying with 0! we thereby “eliminate” the second component of x; such 
an ( is called Givens rotation induced by the vector x. Using 1 = I, we see that 
x = 0 is actually no exception and the elimination relationship (9.2) can be made 
to hold for all x € K?. 
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9.10 Using Givens rotations, a QR decomposition of a general matrix A € K”*” 
can be calculated without any condition on the column rank by column-wise “elimi- 
nating”, entry by entry, every component below the diagonal from the bottom up. 
Schematically this process takes on the following form: 


a ee , [eX * * 1 # of ff , {O # ¢ 
kk O* # ft fi o ¢ ¢ O « * O *« «* 
x * Ox Oo ¢ ¢ O * x O *« x O * * 
eo * * x * * xk * * * x * ok x 
i O * ie. O * * , Oo ¢¢ ¢ ie O * * ie O * x 
Belge «| Selo ¢ ¢|S5]0 0 4/25 ]0 0 «| S50 o + 
o# 4 0 0 ¢ 0 0 « 0 0 ¢ 0 0 0 
0 0 ¢ 0 0 x 0 0 x 0 0 0 0 0 O 


Here, the ‘*’ represent arbitrary elements. In every step, the two elements that 
make up a Givens rotation QO; are colored blue. This way, assuming a suitable 


dimension for I, we have 
; riya 
eee “a ( ~ lof’ 


as constructed in (9.2). In order to make it clear that Q; only has an affect on the two 
related rows, we label the elements that undergo a change during multiplication 
with a ‘4’. Finally, after s such steps, the product?” Q’ = Q!--- Qi lets us arrive at: 


2 
I 


Theorem. For A € K*" with m > n, there exists a unitary matrix Q € K'"*" and 
an upper triangular matrix R € K"*", such that?® 


R 
A=(Q|Q){>], A= 9.3) 
ee -_-_——4/ 
=Q 
The first relationship in (9.3) is called the full QR decomposition of A, while the second 
is called the reduced QR decomposition of A. 


Exercise. State a corresponding theorem for m <n. 


27In practical numerical work, one only stores the Givens rotations Q,; and does not explicitly calculate 
the matrix Q. The matrix vector products Qx and Q’y can then be evaluated in 6s flop by applying 
Qj and 0; to the appropriate components of x and y. 

?8In MATLAB (though it is based on the Householder method): [9,R] = qr (A); 
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9.11 The Givens process from §9.10 is especially effective when only a few elements 
need to be eliminated in order to reach a triangular form. 


Example. Only s = 4 Givens rotations are required for the calculation of the QR 
decomposition 


eee eS bet td 
* * & & *1  fO f f fg 

H=|0 « *« « *|/24J/o 04 4 ft] =R 
00 %* * « 000 ¢£ ¢ 
000% * 0000 


In general, matrices H € IK”*™ with such an occupancy structure are called upper 
Hessenberg matrices. As in §5.2 we characterize these matrices by 


AV, C Vea4 (k=1:m-—1). 


Their QR decomposition can be calculated with only s = m — 1 Givens rotations. 
(This will play a large role late during our discussion of eigenvalue problems). 


Exercise. How many floating point operations are needed for a QR decomposition of an 
upper Hessenberg matrix provided Q is not expanded but rather only the individual Givens 
rotations Q4,...,Q, are stored? Answer: #flop = 3m? for K = R. 


Exercise. We have introduced three standard form matrix factorizations: (a) triangular decom- 
position with pivoting: A = PLU; (b) Cholesky: A = LL’; (c) orthogonalization: A = QR. 


e Do there exist factorizations of the following form (pay attention to the underlying 
assumptions and dimensions)? 
(a)PUL _(b)UU'— (*) QL, RQ, LQ 
e Which of the variants can be reduced to its respective standard form? If applicable, 
write a short MATLAB program using the basic commands 1u, chol or qr. 


Hint: A diagonal splits a flat square into two congruent triangles. Which geometric transformations map 
one triangle onto the other? What does that mean for a matrix if considered a flat array of numbers? 


lil Error Analysis 


Most often, instability is caused not by the 
accumulation of millions of rounding errors, 
but by the insidious growth of just a few. 


(Nick Higham 1996) 


Competent error-analysts are extremely rare. 


(Velvel Kahan 1998) 


Contrary to a futile ideal of absolute, uncompromisingly precise calculation, 
the real world knows an abundance of unavoidable inaccuracies or perturbations, 
in short errors, which are generated, e.g., from the following sources: 


e modeling errors in expressing a scientific problem as a mathematical one; 
¢ measurement errors in input data or parameters; 

e rounding errors in the calculation process on a computer; 

e approximation errors in resolving limits by iteration and approximation. 


Despite the negative connotation of their names, such errors, inaccuracies and 
perturbations represent something fundamentally valuable: 


Errors are what make efficient numerical computing possible in the first place. 


There is a trade off between computational cost and accuracy: accuracy comes 
at a cost and we should therefore not require more accuracy than is needed or 
even technically possible. On the other hand, we have to learn to avoid unnecessary 
errors, and learn to classify and select algorithms. For this reason, it is important 
to become familiar with systematic error analysis early on. 


© Springer International Publishing AG 2018 39 
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10 Error Measures 


10.1 For a tuple of matrices (vectors, scalars), we can measure the perturbation 


perturbation ~ 


>T=T-4 E=(A,+£&,,...,At + Es) 


T = (A1,..., At) 
with the help of a basic set of norms”? by defining an error measure [E] as follows: 


e absolute error [Elabs = maxj=1:t ||E;|| 

e relative error*® [Ele = maxj—1:+ ||_E;\| /||Ajll 
How we split a data set into such a tuple is of course not uniquely defined, but 
rather depends on the identification of independently perturbed components. If a 


tuple consists of scalars (e.g., the components of a matrix or a vector), we refer to it 
as a componentwise error measure. 


10.2 For fixed A the error measure E ++ [E] is subject to the exact same rules as a 
norm; the only difference being that a relative error can also assume the value oo. 
A relative error can also be characterized as follows: 

[Elra<e << ||Ej| <e-|Ajl] G=1:4). 
It notably must follow from A; = 0 that E; = 0, that is unless [E]]+<1 = co. Hence, 


Componentwise relative error is for example taken into account when the occupancy 
structure (or sparsity structure) of a matrix is itself not subject to perturbations. 


10.3 The choice of a relative or absolute error measure can be made on a problem 
to problem basis, e.g.: 


e quantities with physical dimensions (time, distance, etc.): relative error; 
e rounding error in floating point arithmetic (cf. §12): relative error; 
e numbers with a fixed scale (probabilities, counts, etc.): absolute error. 


Yet sometimes, this decision is just pragmatically based on which of the concepts 
allows for a simpler mathematical result such as an estimate. 


°For a review of norms see Appendix C. We will limit ourselves to the norms from §C.9. 
3°Tn this context we agree upon 0/0 = 0 and €/0 = © fore > 0. 
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11 Conditioning of a Problem 


11.1 We will formalize a computational problem 


data | —> | math. model f | —> | result 


as the evaluation of a mapping*! f : x4 y = f(x). Here, there are no “actual”, 
precise values of the data that could claim any mathematical truth to themselves, 
but data (unlike, if any, parameters) are inherently subject to perturbations: 


perturbation of data | —>+ | math. model f | — | pertubation of result 


If such a perturbation of x on the input yields some %, by mathematical necessity, 
rather than y = f(x) we “only” obtain the result 7 = f(Z). 


11.2 The condition number describes the ratio of the perturbation of the result to 
the perturbation of the input. For a given problem f, at a data point x, using the 
measure [-]], the worst case of this ratio is—in the limit of very small perturbations— 


defined as / 
«(f;x) = limsup LA(®) — fC) 


Fox [x — x] 


At a data point x, a problem f is called 
e well-conditioned if x(f;x) ® 1; 
e ill-conditioned if x(f;x) >> 1; 
e ill-posed, if x(f;x) =. 


Where exactly do we draw the line for “« >> 1” (read: condition number far 
greater than 1)? The phenomenon of ill-conditioning quantitatively depends on 
the requirements and the accuracy structure of the application at hand. For the 
sake of definiteness, take a value of say x > 10° for the following passages. 


Remark. To put it as lucent as possible: whether a problem is ill-conditioned (ill-posed) or 
not depends only on the underlying mathematical model. The question does not, never 
ever, depend on the prospect that the problem might be, at some point, be tried to be solved 
using algorithms on a computer. The question of ill-conditioning (ill-posedness) is therefore 
completely independent of the realm of algorithms and computers. In the case of a highly sensitive 
problem, one has to carefully consider why a result should be calculated in the first place 
and to which perturbations it is actually subjected. 


31In this very general description, x represents a tuple of matrices as in §10.1. 
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11.3 To exemplify this concept, we use a “classic” of geometric linear algebra. 


Example. The conditioning of a problem can be nicely visualized when we attempt 
to find the intersection of two lines (that is, hereby, the data): 


——— 


The lines shown are only accurate within their width. The intersection on the left 
is essentially as accurate as the data; on the right, for two lines close to each other, 
the accuracy of the intersection is drastically reduced. This is called a glancing 
intersection. The problem on the left is therefore well-conditioned and the one on 
the right is ill-conditioned if put to the extreme. 


11.4 The definition of the condition number of a problem f at a data point x can 
alternatively be put as follows: take the smallest number x(f;x) > 0 such that 


[f(x+w) —f(x)] <«(f-x)-[w] +o([w]) — ({w] - 0). 
Here, the Landau-Symbol o(€) represents a term which converge superlinearly to- 
wards zero as € — 0: o(€)/e — 0. When we omit such additive, superlinearly 
decaying terms and only list the term of leading order relative to the perturbation, 
we can denote this briefly using the symbol ’<’ (and accordingly ‘>’ and ’=’): 


[f(x + w) — f(x)] < «(f;x) - [wv]. 


11.5 For perturbations w — 0, a differentiable map f : D C R™” — R has, by 
definition, the linearization 
f(x+w) = f(x) + fi(x)-w. 
Here, the derivative is the row vector composed of the partial derivatives 
f'(x) = (Of (x),---,Omf(x)) € KM. 


Through linearization, closed formulas for the condition number can be attained. 
The case of componentwise relative error can be completed as follows: 


[F(x+w)- fC) _ yo Fete) — f) 
[wl eae. FI Teo) 


— am cup VOL @ o UM)-wl w Lf Ce)) Le 
~ UP FOL Te] eee [fel dol LF 


K(f;x) = lim sup 
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Here, (a) holds, because numerator and denominator are absolutely homogeneous 
in w. By also proving (b) below, we obtain the following condition number formula: 


Theorem. For a differentiable f : D CR" — Rand f(x) 4 0 it holds 


(f;x) ey (11.1) 


with respect to componentwise relative error in x and relative error in f(x). This formula 
provides the smallest number x(f;x) > 0 such that 


[f(x + w) — f(x) 
If(x)| 


We call the value x(f; x) the componentwise relative condition number of f at x. 


<«(f;x)e where |w| < e|x| ase > 0. 


Proof. Set y’ = f’(x). The perturbation w of x satisfies, with respect to componen- 
twise relative error, [w] = ||wx|loo, where wy € IR” represents the componentwise 
division of w by x; the componentwise product of y and x will be denoted by 
*y € IR”. Hence, in accordance with (C.3), 

*y! | 


I[2|le0 


sup ly! a = sup Py! ws “ws = sup 
[w] 40 [~] wxA0 || x|leo 040 


= [I*y'lleo = [IFylla = Ly'l - lz. 
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whereby (b) is proven. 


11.6 With formula (11.1) at hand, we can directly ascertain the componentwise 
relative condition number « of the elementary arithmetic operations f: 


flare Mi-& ab a/e @ Va 


fal+ eo) 
1 eet 2 1d 472 


for 61,2 >0. 
K 


We will walk through the calculations for addition and subtraction, the remaining 
values are left as an exercise. When x = (Cj )j=1:2 and f(x) = ¢ + é2 it holds 


44). ay ; Ay 
Sey cI (let) - (et) _ pei | 
LF(x)| 1 = Ga lor + 22 [or + Cal 
In the case of addition, this can be reduced for ¢,¢2 > 0 to x = 1. With the notable 
exception of genuine subtraction, all elementary operations are well conditioned 


with respect to componentwise relative error. Genuine subtraction, however, is 
ill-conditioned in the case of cancellation, that is, in the case 


|i + C2] & |e1| + [Cal, (11.2) 


and it is even ill-posed for ¢; + ¢2 = 0. 


32Notice that the zero components of x in this example are not a problem (why?) 


44 Error Analysis [ Ch. Il 


Example. Let us subtract the numbers 
@ = 1.23456 89?- 10° 
@> = 1.23456 78?- 10°, 
where the ’?’ represents uncertainty in the 9th decimal place. This yields the result 
& — & = 0.0000011?- 10° = 1.1?-10-®, 


in which the uncertainty has been moved to the 3rd decimal place of the normalized 
representation (where the leading position is nonzero): We therefore have lost the 
significance of 6 decimal places. In this case, the condition number of subtraction is 


K Fe 2.2-10°. 
Quite generally it can be stated that: 


Compared to the data, a result loses about log,, x significant digits in accu- 
racy (with x the componentwise relative condition number of the problem). 


Exercise. Show the componentwise relative condition number of the inner product x! y to be 
. ly | m 
os (x,y € R”). (11.3) 


Characterize the ill-conditioned/ill-posed case and compare with §11.7. 


Conditioning of Matrix Products and of Linear Systems of Equations 


11.7 Let us examine the matrix product AB for A € K”*", B € K”*? subjected 
to perturbations with a small relative error in the form of 


A=A+E, |lEll<ellAl, B=B+F, Fil <ellBl- 
It then holds, given the triangular inequality and submultiplicativity, that 
|| AB — AB|| < |[E|] - [Bl] + |All - Fil + EI - Fi] < 2el] All - WBIL, 
——" ~~’ 
<elAl|-|B] — <elJAl]- BI] <e? || Al-11B || 
so that the relative error of the perturbed result satisfies the following estimate: 


AB— AB|| . . |All - |B 
tl le IAI IBl 
|| AB|| |A - Bl| 


Here, if only one of the two factors A, B, is perturbed, we can omit the factor 2. 


Remark. The relative condition number x of the matrix product AB therefore fulfills the 
ee All: 1BI 
“|B 
K < 2———__.. 
IA - B| 
In specific cases, such upper bounds provide either proof of a good condition number or 
otherwise hint towards a poor condition number. 
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Exercise. For componentwise perturbations of the form 


A=A+E, |E|\<elAl, 


B=B+F, |E|<e\Bl, 


show the following estimates (the factor 2 is omitted when only one factor is perturbed) 


|AB — AB| < 2e|A| - |BI, 


|AB- ABll.o 4 


IAT = |Bllheo 


(11.4) 


|| AB||oo ||A = Blloo 


11.8 For a linear system of equations Ax = b we distinguish two cases: 


Perturbation of b € K" with a relative 
error of the form 


b=b+r, — |I[rll <ellbll, 


leads to AX = b +1, so that 

¥—-x=Aé'y. 

With the induced matrix norm it holds 
|f— xl] < ATT Ir 


< A" -IAll- Ilell e 


The relative error of the result there- 
fore fulfills the estimate 


Perturbation of A € GL(m;K) witha 
relative error of the form 


A=A+E, — |El| <el/All, 
leads to (A + E)% = b, so that 
x—-%=A'EX. 


With an inducing vector norm it holds 


lx — I < JA TEM - Wel 


< ||A~*}] -[lAll- lL 


The relative error of the result (here 
in reference to £) therefore fulfills 


I|x = $I] 


<K(A)je. (11.5) 


Here, we have defined 
K(A) = ||A~*|- |All 


as the condition number of A; for a non-invertible A € K"™*™ we set x(A) = ov. 


Remark. In both cases, we see in the limit € — 0 of small perturbations that the relative 
condition number « of the linear system of equations is bounded by « < «(A). Actually, 
equality holds true if just the matrix A is perturbed (cf. §11.9 as well as (16.4)). 


Exercise. Show that given a componentwise perturbation of Ax = b where 
b=b+r, |r| <elb, or A=A+E, |E|<eAl, 
the relative error of the perturbed result % is bounded by 


LAHAT |x Ileo. 
[Ilo 


I|¥ = xlhoo - 


< cond(A,x) €, cond(A,x) = 
I|X|]o0 
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This Skeel—Bauer condition number cond(A, x) of a linear system of equations further satisfies 
cond(A, x) < cond(A) = |||A~"| |A] |leo < te0(A) = || A |loo|| Aloo: 


Construct a matrix A where ..(A) >> 1 that nevertheless satisfies cond(A) ~ 1. 


11.9 Geometrically speaking, the condition number of a matrix A describes the 
distance to the nearest non-invertible (short: singular) matrix: 
Theorem (Kahan 1966). For the condition number x(A) belonging to an induced matrix 
norm it holds that 

I| El 


x(A)~! = min { 4 :A+E is singular \ ; 


This is consistent with the extension of the definition to x(A) = ©¢ for a singular A. 
Proof. If A+ E is singular, there is an x 4 0 where (A + E)x = 0, which implies 
: = ||| 
0 < [xl] = |[A*Ex|| < ATI MEI [ell = MA) Tay lll 


and therefore via division x(A)~! < ||E||/|| Al. 

In a second step, we must construct an E such that equality holds (for the sake 
of simplicity, we will restrict ourselves to the || - ||>-norm). To this end, we choose 
a vector y normalized to ||y||2 = 1, for which the maximum is taking in 

|AW*l]2 = max ||A~*ul|2 = ||A~*yll2. 


Il [l2=1 
We set x = A~!y A 0 and select the perturbation of A as the outer product 


go 
x'x 


The spectral norm can be calculated as (we will leave step (a) as an exercise) 


@ llyllalitl,2 1 I|Ella 
llxll3 IIxll2_ |All [Alle 


But now A + E is singular, because (A + E)x = y—y =Oand x £0. 


I|Ell2 = K(A)71. 


Exercise. Carry out the second half of the proof for the matrix norms || - ||; and || - |loo- 


11.10 If a matrix A is subjected to an uncertainty of relative error € while having 
a large condition number x2(A) > e~!, it then could actually “represent” some 
singular matrix A as provided by Kahan’s theorem (think of an uncertainty as a 
kind of blur that makes things indistinguishable). Hence, we define: 


Matrices A with x2(A) > e~! are called e-singular.* 


33 Anticipating the discussion in §12.6 we put on record that e-singular matrices are called numerically 
singular if € < Emach, Where €mach is the machine precision to be defined in Lemma 12.2. 
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12 Machine Numbers 


12.1 A t-digit floating point number ¢ # 0 of base 6 € N35» is written as 


C= +dy.dz---d; x B° with digits dp € {0,1,...,B —1}; 
a 


Mantissa 


where the exponent e € Z is normalized such that the leading digit satisfies d, 4 0.54 
Alternatively, we can also represent such ¢ uniquely in the form 


G=tm-B tt! me {pl pe t+1,...,p'-1} CN, eeZ. 


We will call the set of all such floating point numbers, together with zero, machine 
numbers F = Fp}. 


12.2 The operation of rounding fl: IR — F maps ¢ € R to the nearest machine 
number fl(@) = @ € F.* Rounding is obviously 


e monotone € <y = fl(G) < f(y); 
e idempotent fl(¢) =é for ¢ € F. 


Lemma. The relative error of rounding fl(€) = é is bounded by the so-called machine 
precision™ Emach = 4p, 


eae 
I¢| 


¢ 
Equivalently, € = €(1 +) where |e| < €mach- 


| < Emach- 


Proof. Without loss of generality let ¢ > 0 which implies that 0 < ¢9 < € < ¢, for 
two consecutive machine numbers ¢9,¢1 € F. Then 


CE {20,61} with |¢—¢| < (C1 — G0) /2 
and from the representation @, = (m+ k)B°t!~ with B'~! < m < B! it follows 


lé ¢| 1 C1 Co 1 1 1-t 
< = < B =€E P 
Z| 2 E 2 2 mach 


The second statement is only a rewording of the first with e = (é — €)/é. 


34When B = 2, the leading digit is always d, = 1. As hidden bit it does not need to be stored. 

35]f there is a tie between two possibilities, we choose the one with an even d;. Being statistically 
unbiased, this rounding mode is generally preferred and is called round-to-even. 

36Tn the literature this quantity is also called machine epsilon or unit roundoff, often denoted by uw. 
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12.3 On a computer, the exponent e is restricted to a finite set 
{emin, €min + 1,...,€max — 1, max } CZ. 


Numbers é € F with an absolute value that is too large or small can therefore no 
longer be represented and can lead to exponent overflow or underflow respectively. 
As a default, computers “catch” these limiting cases with assigning the values too 
and zero without any notice. By applying some care, over- and underflow do not 
pose all too great a threat in numerical linear algebra (except for determinants). 


12.4 Essentially every computer since 1985 implements the IEEE 754 standard, 
which offers two binary formats of hardware arithmetic:°” 


single precision double precision 


32 bits = 4 bytes with the storage scheme | 64 bits = 8 bytes with the storage scheme 


s:1] @:8 f:23 s:1} e:11 f:52 
(-1)82°°?? x Lf e@=1:254 (—1)8 22-103 x 1. f =e = 1: 2046 
g (-1)82-P’ xOf e=0 = (-1)§2- x0f e=0 
(—1)§ 00 e = 255, f =0 (—1)$ 00 e = 2047, f =0 
NaN e=255,f £0 NaN e = 2047, f £0 
base p: 2 base p: 2 
mantissa length t: 24 mantissa length t: 53 
overflow threshold: 2!27(2 — 2-73) overflow threshold: 21973(2 — 2-5?) 
3.4 x 10°8 ® 1.8 x 10°08 
underflow threshold: 2-176 ~ 1.2 x 10-38 underflow threshold: 2-1? ~ 2.2 x 10-308 
precision €mach! 2-74 = 5.96 x 10-8 precision €mach: 2-53 = 1.11 x 10716 


corresponds to approximately 8 decimal places | corresponds to approximately 16 decimal places 


IEEE 754 also recognizes the symbols 


e +00, e.g., as the value of +1/0; 
e NaN not a number, e.g., as the result of 0/0 or co — oo. 
Both alleviate and standardize handling of arithmetic exceptions. 
12.5 IEEE 754 furthermore defines that arithmetic operations and the square root 


are computed for machine numbers by correctly rounding the exact result: if we 
denote the realization of an operation * by * we get for ¢,7 € F that 


Gey =A xy) (KE {t+ —1/,/}). 


Hence, Lemma 12.2 implies the following standard model for machine numbers: 


37By default, MATLAB uses double precision, but also allows the use of single precision. 
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For €,y € Fand x € {+,-,-,/, J} there is an \€| < Emach, such that 


gy = (Exy)(1 +6) (12.1) 


for the machine-realization % of the operation x. 


Remark. Both the associative law and the distributive law loose there validity in machine 
arithmetic: parentheses should therefore be placed very deliberately (since otherwise 
“dramatic” consequences could follow as we will see in §14.4). 


Exercise. In order to realize complex arithmetic with real operations, how must €mach be 
adjusted for the standard model to remain valid in K = C? 


12.6 The first rounding errors can occur as early as during the input of data into 
the computer. This is, for example, because the simplest decimal fractions are 
already be represented by non-terminating binary fractions, such as 


0.149 = 0.00011002, 


which therefore must be rounded to a machine number. In the case of data x (this 
can be a tuple of matrices) and a problem x ++ f(x) as in §11.1, this causes an 
input of machine numbers £ = fl(x) with a componentwise relative error 


|< — x| < Ema 
If we first apply a monotone norm and then the equivalences from §C.9, this means 
that with respect to every relative error measure there is an input perturbation of 
the form?’ 

[% _ xX] rel = O(€mach): 
This kind of input perturbation leads to unavoidable perturbations of the result; 
even when the computer calculations after data input were exact. With an appro- 
priate condition number, §11.4 provides the estimate 


Lf(£) — F(x)] = OG; x) €mach):- 


38The Landau symbol O(€mach) stands for a bound of the form 


|O(€mach)| < C€mach (€mach < €0) 


with some constant c > 0. Here, for statements concerning problem classes and algorithms, let us 
agree that c does not depend on any specific data or the machine precision €mach, but may well 
polynomially depend on the dimensions of the problem at hand. 

When assessing concrete instances (of the problem and €ach at hand), we will accept such 
constants c (as well as ratios c = a/b for comparisons of the form a % b) which, for example, allow 
for a maximum loss of accuracy of one third of the mantissa length: 


—1/3 


cx Emach* 


When working with such assessments, always remember: “your mileage may vary”—the principle, 
however, should have become clear. 
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13 Stability of an Algorithm 


13.1 An algorithm for the evaluation of f is ultimately a decomposition 
f= froroft 
e—S>>-— 
algorithm 


into sequentially executed computational steps such as elementary arithmetic op- 
erations, Level-n BLAS or other standardized operations. However, when compared 
with the map f, on the computer the use of machine numbers leads to a perturbed 


map”? x & és 
fsfRer seth, 
whereby fi represents the execution of the standardized operations fj under 


consideration of all rounding errors involved. 


13.2 An algorithm ‘i for the evaluation of the problem f is called stable, if 


Lf (x) — f (¥)] — O(€mach) 


for suitably perturbed inputs % (depending on the actual input x) within the scope 
of machine precision 


[*-x] = O(€mach) 


otherwise it is unstable. According to Trefethen and Bau, we can formulate this as: 


A stable algorithm gives nearly the right answer to nearly the right question. 


Remark. A stable algorithm for the problem f therefore behaves completely comparably to 
the sequence flo f 0 fl, which corresponds to the minimum amount of rounding needed. 


13.3 Many algorithms in numerical linear algebra fulfill a concept, that is both 
stronger and simpler than stability. An algorithm 7 for the evaluation of the 
problem f is called backward stable if actually 


f(x) = f(&) for a perturbed input ¥ with [% — x] = O(€mach): 


We call such % — x the backward error of the algorithm f ; comparing the backward 
error with the machine precision is called backward analysis. According to Trefethen 
and Bau we can formulate this all as: 


A backward stable algorithm gives exactly the right answer to nearly the 
right question. 


3°Due to the dependence on the considered algorithm, there is no “absolute”, unique f. We therefore 
always identify f as the realization of a concrete algorithm using a given machine arithmetic. 
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Example. The standard model (12.1) of machine arithmetic implies that the arith- 
metic operations are realized as backward stable. For ¢,4 € F it holds actually 


Ging=¢+9 
Cg =C-y 
e/y=é/y 


where é = (1+ ¢), and 7 = 7(1+€), for suitably chosen |e| < €mach- 


Remark. When correctly rounded, F is actually exact in the case of cancellation (11.2). 


Exercise. Show that the square root is realized as backward stable in the standard model. 


13.4 Now we ask, how accurate is the result of a stable algorithm? Since we can 
estimate the accuracy of f() according to §11.4, backward stability yields directly 
the error estimate*® 


Lf (x) — f(x)] = O((f;*) emach)s 
we call f(x) — f(x) the forward error of the algorithm f. 
It would be futile to require a greater accuracy from an algorithm since, according 


to §12.6 the input into the computer of x alone creates a perturbed ~ whose exact 
result were subject to the same error estimate: 


Lf(£) — F(x)] = OCF; x) €mach): 


Comparing the forward error of an algorithm with the unavoidable error deter- 
mined by the conditioning of the problem is called forward analysis. 


Stability Analysis of Matrix Products‘! 


13.5 The inner product 7 = y’ x of two machine vectors x, y € F” can be realized 
recursively with the following simple algorithm: 


7 = 0, fe = Tie + pe Ge (R= Den). 
According to the standard model, we write this in the form 
ft = (Fea + (Me &k)(1+ex)) (145) (k= 1m) 


with the relative errors |e,|,|5¢| < €mach (where 6, = 0). If we collect all of these 
errors as an backward error of the components of y, the expanded result is 


im = ri “Xx, lj —y| < M Emach|¥\, 


The same result holds for stable algorithms, if x(f;x)~! = O(1) (which is usually the case). 
“Notation as in §§2.8 and 2.10 
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where ff, = yx (1+ ex) (1 + dg) +++ (1+ 6m) = 9¢(1 +O), So that |O,| < memach-” 
Since this all is valid independent of the order of the terms, we keep for the record: 


Conclusion. Level-1-BLAS computes backward stable inner products. 


13.6 The matrix vector product y = Ax can be realized row-wise as the inner 
product of x € F” with the row vectors of the matrix A € F”*”. If we collect 
the errors as the backward error of the row vectors, the following backward error 
estimate holds for the machine result 7 € F” 


G=(A+E)x, — |E| <1€machlAl. 


Conclusion. Level-2-BLAS computes backward stable matrix vector products. 


13.7 For C = AB with A € F™*", B € F”"*? we obtain column-wise 


d= (A+E))v, |E;| < 1 EmachlA| (j=1: p). 


The column vectors ¢! are therefore computed backward stably. In general, this 
does not apply to the computed product matrix C itself. 


Example. For an outer product C = xy’, the value Cis generally no longer a rank 1 matrix. 


Fortunately enough, we obtain this way an estimate of the forward error that 
lies within the order of the unavoidable error, cf. (11.4): 


JC — C| < n€mach|Al - [BI- 


If A > 0, B > 0 componentwise, it holds |A| - |B| = |C| and the result C behaves 
as the input of C subject to a machine precision of N€mach- 


Simple Criteria for the Analysis of Numerical Stability 


13.8 Comprehensive error analysis as presented in the prior pages can quickly 
become tedious. In order to spot the weak points of an algorithm that would 
seriously endanger stability, we consider the error transport within an algorithm 
sectionwise by writing 


fa oe ee oT Bog 
VY «- \ 
=h =8 
we call h an end section and g a start section of the algorithm. The total error of f 


is—either considered as a forward or as a backward error—in leading order the 
superposition of the contributions created by just rounding between such sections: 


fe =ho flog. 


“Due to 5; = 0 there is a maximum of m factors of the form (1+ «) with |a| < emach to be found in 
the expression 1, . 
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If these contributions do not “miraculously” cancel each other out (which one 
should not count on), any large individual contribution will already be responsibly 
for instability. In this sectionwise analysis, the intermediate rounding 2 = fl(g(x)) 
in f, contributes 


e to the forward error by 
Lf (x) — f(x) = O(«(h; 8(X)) €mach), 
e to the backward error (for invertible g) by 


[x+ — x] = O(x(g~':g(x)) Emach) 
where the data x, = g~!(2) is reconstructed from the intermediate result 2. 
Forward analysis (F) and backward analysis (B) directly provide the following: 


Instability Criteria. Important indicators for the instability of an algorithm with a 
section decomposition f = ho g are 


F: an ill-conditioned end section h with x(h; g(x)) >> «(f;x); 


B: an inversely ill-conditioned start section g with x(g~!;¢(x)) > 1. 


Remark. Actual evidence of instability must always be based on a concrete numerical example. 


Exercise. Show that for scalar start and end sections g and h Criteria F and B are equivalent 
with respect to componentwise relative errors; it then actually holds 


k(etse(xy) — 8) 
(g “38(x)) oe (13.1) 


13.9 By definition of the condition number, an algorithm f given by the decom- 
position 


f=fso---ofoofi 


leads directly to a multiplicative upper bound of the condition number of f, i.e., 


wf; 2) SM fol fea = )) >= RCP fx) ) RU), 


which is generally a severe over-estimation: it represents the worst case error ampli- 
fication of every single step but these critical cases do not necessarily match one 
another. The estimate is therefore neither suitable for estimating the condition of 
the problem f, nor for judging the stability of the algorithm f . One exception is 
notably where effectively everything is benign: 


When a,b > 0, a > b is equivalent to a/b > 1. By doing so, we consistently choose one and the 
same “pain threshold” as our interpretation of “very large”. 
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Stability Criterion for “Short” Algorithms. If it simultaneously holds that 
(1) all fi are stable, 
(2) all f; are well-conditioned, 
(3) the number of steps s is small, 

then f = f,0---0 fy isa stable algorithm and f a well-conditioned problem. 


Example. Conditions (1) and (2) are satisfied with respect to componentwise relative 
errors if the fj consist exclusively of the benign operations of the standard model, 
that is, if they do not contain a genuine subtraction.* If 


S = Sadd + Ssub + Smult + Sdiv + Squad 


represents the decomposition into the number of genuine additions, genuine sub- 
tractions, multiplications, divisions and square roots, then, according to §11.6, the 
following condition number estimate for the problem f is valid:® 


Ssub =O => K(f; x) < 2Smult*Saiv (13.2) 


Genuine subtractions could be problematic in the case of cancellation; one should 
either avoid them altogether or carefully justify them with an expert analysis. 


Exercise. Compare (13.2) with (11.3) for x'y if x,y € IR™ have positive components. 


14 Three Exemplary Error Analyses 


In this section we will restrict ourselves to componentwise relative errors. 


Error Analysis |: Quadratic Equation 


14.1 In junior high school, one learns the quadratic equation x” — 2px — q = 0 
and its textbook solution formula (which is an algorithm if taken literally) 


xo =p—\/P+4, m=ptypP tg. 


In order to avoid a tedious case-by-case study, we will limit ourselves to p,q > 0. 
The error analysis can be done by a mere “clever inspection”, i.e., basically without 
any calculations that would go beyond simple counting: 


e According to the stability criterion for short algorithms, the formula for x1 
is stable; additionally, (13.2) proves well-conditioning: «(x1; p,q) < 2. 


“We identify € + 7 for |@ + y| = |€| + || as genuine addition, otherwise as genuine subtraction. 
45For square roots we have to set x = 1 instead of x = 1/2, since generally data and intermediate 
results remain stored during execution and their perturbations remain untouched. 
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e The formula for x9 exhibits the section decomposition 


f: (pq) (r.\/p +4) = (pr) > p-r=%. 


For q < p*, the subtraction in the end section / is afflicted with cancellation 
and thus ill-conditioned (cf. §11.6): therefore, by Criterion F, the formula is 
presumably at risk of being unstable (more precise information can only be 
acquired via conditioning analysis of f : (p,q) > x0). 


Example. A numerical example in MATLAB (using double precision) is illustrating 


this very clearly: 
1 >> p = 400000; q=1.234567890123456; 
22> © = sqrt{p 2+q); 20 = p = xr 
> xO = 


-1.543201506137848e-06 


Here, actually 11 = 6 +5 decimal places are lost by cancellation, so that the computed 
solution x0 only contains at most approximately 5 = 16 — 11 correct decimals places.4¢ 
We still however have to clarify whether to blame the algorithm for this loss of 
accuracy or, after all, the ill-conditioning of f : (p,q) + Xo. 


e The inverse start section g~! : (p,r) ++ (p,q) is also subject to cancellation 
forqg < ce in particular the reconstruction of the coefficient 
2 2 
q=r > 7. 
Here, accurate information about q is lost through Fae By Criterion B, the 
formula for xp is therefore at risk of being unstable. 


Example. In the numerical example, we expect the reconstruction of q to exhibit a 
loss (that is, a backward error) of 11 = 2-6 — 1 decimal places, i.e., once again we 
get only approximately 5 = 16 — 11 correct decimal places (which upon comparison 
with q can instantly be observed): 

5 Beare, 

6 alia 

7 1.234558105468750e+00 


Since the backward and forward errors are both of the same magnitude (a loss of 11 
decimal places), the map f : (p,q) + x9 should be well-conditioned. 


*6The further 11 decimal places of “numerical garbage” arise through the conversion of x0 from a 
binary number in where zeros have been appended to the end of the mantissa (here: concretely 38 
bits): the MATLAB command num2hex (x0) reveals beb9e40000000000. 

47Since this loss of information in q does not affect the stability of the formula for x1, it must be the case 
that for q < p?, the value of x; is quite immune to a perturbation of q: in fact, one calculates that 
K(x13qg) = (1— p/p? +q)/2 «1. 
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14.2 A stable algorithm for x9 must therefore (a) eventually get along without 
subtraction, and (b) use the information in the coefficients q in a much more direct 
fashion. The factorization x? — 2px — q = (x — x9)(x — x) ultimately provides us 
with a further formula for x9, which can do both (called Vieta’s formula): 


xy =—q/e1,  H1 =p+ r+. 


As with the formula for x1, it is stable according to the stability criterion for short 
algorithms; (13.2) now proves (p,q) ++ x9 to be well-conditioned: (x0; p,q) < 4. 


Example. In the numerical example, this looks as follows: 


8 pt. 
9 >> x0 = -q/x1 
10 xO = 


-1.543209862651343e-06 


At this point all 16 shown decimal places should be (and are) correct (due to the stability 
of the formula and the well-conditioning of the problem); a comparison with the results 
of the “textbook formula” for xg in §14.1 now verifies that, as predicted, only 5 decimal 
places of the latter were correct: 


In the case of cancellation, q < p, the “textbook formula” for xg is unstable. 
Remark. A detailed conditioning analysis of both solution maps (p,q) ++ x9 and (p,q) H+ x4 


is actually no longer necessary. The condition formula (11.1) yields 


1 3p 1 p 
K(x9; p,q) = = + ———— <2, K(x1;p,9) = = + ————_ << 1 
(x07 Pq) = 5 a rar (x17?) = 5 i are 
our expert analysis “by inspection” overestimated the upper bounds by just a factor of 2. 


Exercise. When q < 0 and p € R, discuss the cancellation case 0 < p?+q < p* + |q]. 


Error Analysis 2: Evaluation of log(1 + x) 


14.3 Ina safe distance to the singularity at x = —1, the function f(x) = log(1+ x) 
is well-conditioned: 
_ [fF Ix] x 


"EO = “Tea, = T¥x)logtitay <2 & 7-97). 


Let us take the expression log(1+ x) literally as an algorithm by decomposing it to 


fixr®o1 x= wr logw. 


This exposes the risk of instability for x ~ 0 in accordance with both instability 
criteria, Criterion F and B:*8 


48Recall that they are equivalent for scalar functions, cf. (13.1). 


Noe 
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F: In the vicinity of its root w = 1, h(w) = log(w) is ill-conditioned.*” 


B: When w ~ 1, the expression ¢~!(w) = w — 1 experiences cancellation, and 
is therefore ill-conditioned (information about x ~ 0 gets lost in w = 1+ x.) 


Example. A numerical example exhibits the instability: 


>> x = 1.234567890123456e-10; 
>> w = 1+x; f = log(w) 


3; f= 


1.23456800330697e-10 


5 >> w-1 
6 ans = 


1.23456800338317e-10 


For w — 1, a total of 10 = 1+ 9 digits are canceled out, leaving only 6 = 16 — 10 decimals 
places correct (backward error). Due to x(f;x) * 1, only approximately 6 decimals places 
in f are expected to be correct (forward error). 


14.4 A stable algorithm for f must therefore (a) process the intermediate result w 
in a much better conditioned fashion and (b) use the information in x in a more 
direct way. To this end, Velvel Kahan came up with the following “twisted” idea: 


ee 4 wl, 
w=14x, f(x) = w—1 
x w=1 


The trick is that the section ¢ : w ++ log(w)/(w — 1) is actually consistently 
well-conditioned (and the 1 is not subject to rounding errors on the computer): 


le’ (w)||wl — 1-w+wlogw 
(30) = Tew). 7 to Tlogw 


The cancellation in the denominator w — 1 of ¢ is therefore canceled away by the 
perfectly correlated imprecision of log w in the numerator: 


<1 (w > 0). 


In a given algorithm, imprecise intermediate results are allowed every time 
their errors are subsequently compensated for. 


According to the stability criterion for short algorithms, Kahan’s idea is stable. 


Example. Using the numerical values from above, Kahan’s idea yields: 


8 >> £ = log (w)/(w-1) *x 


1.23456789004725e-10 


+ As is the case with functions in the vicinity of their roots in general if we use relative errors. 
59s long as the library routine for log w is stable. 
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In this case, (due to the stability of Kahan’s algorithm and the well-conditioning of the 
problem) almost all 16 shown decimal places should be correct (they all are). A comparison 
with the result of the “naive” expression in §14.3 verifies that, as predicted, only 6 decimal 
places of the latter were correct.5! 


Remark. Kahan’s algorithm serves as a “dramatic” example of the invalidity of the asso- 
ciative law in machine arithmetic: (1 + x) — 1 4 x for 0 = x € F. It would therefore be a 
profound blunder to “simplify” a program such as 


log (1+x) *x/((1+x)-1) 
to log(1+x). Unfortunately, there are programming languages (e.g., Java) that offer such 
bizarre “performance features”; Kahan fought for years against this very thing.> 
Error Analysis 3: Sample Variance 


14.5 Descriptive statistics offer two rival formulas (that is, algorithms if taken 
literally) for the sample variance S? of a real data set x = (x1,---,X%m)! € IR": 


2 m 
2 (a) 1 m -\2 (b) 1 m 2 1 m a 
S ~ aT GI *) | Ln ee B= Dx 


j=l 


Formula (a) requires two passes over the data: one to initially calculate the sample 
mean X, and then a further pass to accumulate the sum Lilx; — x)*. If the sample 
variance is to be calculated while new data are being generated, many statistics 
textbooks will recommend to alternatively use formula (b): in this case, just a 
single pass is necessary, whereby both sums )); x; and )/ x7 are accumulated. 


Unfortunately, formula (b) is numerically unstable, while formula (a) is stable:°3 


e The end section of formula (b) is executing (aside from the well-conditioned 
and stable, and thus harmless division by m — 1) a subtraction. In the case 
of cancellation, i.e., the case of a relatively small variance 


Pex, 
this subtraction is actually ill-conditioned. Hence, according to Criterion F, 
formula (b) is presumably at risk of being unstable. 
Example. For an actual demonstration of instability, numerical examples can easily be 
found in which formula (b) returns a ridiculous negative result in machine arithmetic: 


1 >> x = [10000000.0; 10000000.1; 10000000.2]; m = length(x); 
2 >> $2 = (sum(x. 2) =sum(x)~2/m)/ (m-1) 
3 2 

-3.125000000000000e-02 


51MATLAB offers Kahan’s algorithm for log(1 + x) using the command logip(x). 

52W. Kahan, J.D. Darcy: How Java's Floating-Point Hurts Everyone Everywhere, UC Berkeley, 1998-2004. 

°3For a stable single-pass algorithm see T. F. Chan, G. H. Golub, R.J. LeVeque: Algorithms for computing 
the sample-variance: analysis and recommendations. Amer. Statist. 37, 242-247, 1983. 


Sec. 14] Three Exemplary Error Analyses 59 


The cancellation of 17 = 2-8 +1 > 16 decimal places explains how the result could 
become such a “total loss”. 


e In contrast, formula (a) contains the (genuine) subtractions in its start section, 
B: (M1)... Xm) A (X41 — ¥,..-, Xm — ¥) = (61,.--, 6m). 
Here, the reconstruction of data by 
gs By. Om) 2 Oy +E. Oe tS) 


is the decisive factor for Criterion B. Since it is well-conditioned for small 
fluctuations |6;| = |x; — x] < |x|, the start section does not pose a threat to 
the stability. As a sum over m non-negative terms, the end section 


m 
ee eae 2 
J= 


is well-conditioned and is easily coded on the computer in a backward stable 
fashion. Formula (a) is numerically stable, indeed. 


Example. In the numerical example, we get the following: 


5 >> xbar = mean(x); 

5 >> $2 = sum((x=xbar).~2) /(m=1) 
7 §2 = 

8 9.999999925494194e-03 


A quick mental calculation provides S2 = 0.01 so long as we regard the data set x as 
exact decimal fractions. The loss of 8 decimal places in our stably calculated solution 
must therefore be due to an ill-conditioning of about «(S?; x) ~ 10°. 

Without knowledge of the “exact” result (which, of course, we would never know 
when doing serious computations), an evaluation of the accuracy of a stably calculated 
result requires at least a rough estimate of the conditioning of the problem. 


14.6 The componentwise relative condition number x of the sample variance S2 
relative to the data set x € IR” can be directly calculated with (11.1) as 
2 e 2||xll2 
K(S7) x) = |x; — &| |x;| < —— =. 
(m —1)S? dX A M$ /m—1 


Example. Hence, the numerical example confirms finally that x ~ 2 - 10°: 


9 >> kappa = 2*abs((x-xbar)) ’*abs(x)/S2/(m-1) 


11 


) kappa = 


2.0000e+08 


In this case, the loss of approximately 8 decimal places was therefore unavoidable. 


Remark. Notice that only the order of magnitude of a condition number is relevant information. 


54The upper bound is found with the help of Cauchy-Schwarz inequality (T. F. Chan, J.G. Lewis 1978). 
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15 Error Analysis of Linear Systems of Equations 


In this section, A € GL(m;IK) and b € K" and error measures are generally normwise relative. 


A posteriori Assessment of Approximate Solutions 


15.1 We would like to asses the error of a given approximate solution ¥ € K” of 
a linear system of equations Ax = b. There are two options to look at, forward 
and backward error, which differ significantly in their accessibility: 


e Without knowledge of an exact solution x, the forward error ||% — x||/||x|| can 
only be estimated and an assessment must be made in comparison with the 
unavoidable error in terms of the condition number «(A). 


e The backward error is defined as the smallest perturbation of the matrix A 
which makes * exact: 


w(X) = min { al (A+ E)¥= v}; 


as seen next, it can be computed without resorting to an exact solution and 
can be assessed by direct comparison with the accuracy of the data. 


15.2 There is actually an explicit formula for the backward error. 


Theorem (Rigal-Gaches 1967). The backward error w(X) of & # 0 is 


(lel ; 
w(x) = ay r=b— AX. (15.1) 
®) = Tana 


Here, r is called the residual of the approximate solution X of Ax = b. 
Proof. °° From (A + E)% = b, hence EX = 1, it follows that 


r IIr| [Ell 
IIrll < HEMEL, <7. 
AT Ell ~ (Atl 


We must now construct E such that equality holds in this estimate. For the sake of 


simplicity, we limit ourselves to the || - ||2-norm: 
=/ ~ 
— bs with |Ell2 = lalla 
ae Il llo 


satisfies E¥ = r and ||r||2 = ||E||2||¥ll2- 


Remark. A numerical example for (15.1) and (15.3) can be found in §§15.10 and 15.13. 


>Notice the similarity to the proof of Theorem 11.9. 
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Exercise. For componentwise analysis, the backward error is similarly defined as 
We(*) = min{e: (A+ E£)% =b,|E| <elAl}. 


Prove the Oettli-Prager Theorem (1964) (recall the convention in Footnote 30): 


r=b— Ak. (15.2) 


Compare w.(%) with the normwise backward error w(%) (with respect to the co-norm). 


15.3 From x — ¥ = A~!r and the expression (15.1) of the backward error w(#) it 
directly follows a simple forward error estimate, cf. (11.5): 


x= 1h < WAT Mell _ 


zl Wal 


K(A) - w(#). (15.3) 


Remark. Since the calculation of x(A) = ||A7"|| - || Al] would often be much too costly and 
the only thing that matters in an estimate like (15.3) is actually the order of magnitude of the 
given bound, in practice, one often uses much more “low-cost” estimates of K(A); e.g.,°° 


condest(A) % estimate of «1(A) 
condest(A’) % estimate of Kx(A), cf. (C.3) 


Exercise. Show the alternative forward error estimate 
< cond(A, £) - we(£). 


using the Oettli-Prager Theorem and the Skeel—Bauer condition number. Given reasons 
why this estimate is sharper than (15.3) if ¥ has been computed in a componentwise 
backward stable fashion. 


A priori Stability Analysis of a Solution by Matrix Factorization 


15.4 In Chapter II we solved the linear system of equations Ax = b with a suitable 
factorization A = MN (where the right hand side b is supposed to be fixed): 


f: Ars (MN) HS x. 
In doing so, the start section g is exactly the factorization step; specifically, one of 
e triangular decomposition P'A = LU with pivoting: M—=PL,N =U; 
e Cholesky decomposition A = LL’ (A s.p.d.): M=L,N=UL’; 
e QR decomposition A= QR: M=Q,N=R. 


56N. J. Higham: Algorithm 674: FORTRAN codes for estimating the L1-norm of a real or complex matrix, with 
applications to condition number estimation, ACM Trans. Math. Software 14, 381-396, 1988. 
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The end section ft contains the computation of x from the factorization, a process 
which can be realized algorithmically in a backward stable fashion by either forward 
substitution, back substitution or by multiplication with Q’: in the end, such 
triangular or unitary factors F € F”*™ yield computed solutions 7 € F” of the 
system Fu = v € F” which satisfy 


(F+AF)Q=0, — ||AF|| = O(€mach)||FIl- 


The proof can be carried out similarly to §§13.5 and 13.6. 


15.5 It remains to better understand the start section g. Because of 
g ':(M,N)OM-N 


its contribution to the backward error in A is of order (see §§13.8 and 11.7) 


M|| - ||N 
(Mon) 


In fact, all three factorizations (with A s.p.d. if Cholesky decomposition is em- 
ployed) yield a solution £ € FF" of Ax = b that satisfies”” 


(A+E)£=b, — |E|| = O(||MIl || Nl €mach) (15.4) 


for machine data A and b. The factorization based algorithm for the solution of a 
linear system of equations Ax = b is therefore 


e vulnerable to instability in the malignant case ||M|| - ||N|| > || All; 


e provably backward stable in the benign case ||M|| - ||N|| = || All. 
Pp y Ss 


Exercise. State similar criteria for componentwise backward stability. 


15.6 By unitary invariance (C.2),a QR decomposition A = Q- R satisfies 


QllaRll2 = [Rilo = l]QRll2 = || Alle, 


so that the benign case is at hand. With the modified Gram-Schmidt, Givens or 
Householder methods, the linear system of equations Ax = b is therefore provably 
solved in a normwise backward stable fashion. 


Exercise. Construct a 2 x 2 matrix A and a right-hand-side b, such that the numerical 
solution ¥ of the linear system of equations Ax = b by QR decomposition is not backward 
stable with respect to componentwise relative errors. How about the stability of x9 = A\b? 


Hint: Calculate the componentwise backward error w.(%) with the help of (15.2). 


57Cf. §§9-10 and 18 in N. J. Higham: Accuracy and Stability of Numerical Algorithms, 2nd ed., Society of 
Industrial and Applied Mathematics, Philadelphia, 2002. 
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15.7 If Aiss.p.d., there is A’ = A and A;(A) > 0. Hence, its spectral norm satisfies 
2 
| Alls = max A; (AA) = max A; j(A?) = = max Aj j(A) = ( max A; (A)) 
i= j=l:m 


and that of the factors of the Cholesky decomposition A = LL’ satisfies 


Ella = Lz = max Aj(LL') = max Aj(A). 
j= j=lim 


Thereby, we have shown that the benign case is at hand, 
LllallL'l2 = lAll2, 


and the linear system is thus provably solved in a backward stable fashion, too. 


15.8 It is more challenging to handle the topic of triangular decomposition. With- 
out loss of generality we assume that the matrix A € GL(m;K) has the normalized 
triangular decomposition A = LU (in general, of course, we have to use partial 
pivoting: simply replace A with P’A in the following analysis). 

The most useful estimates can be obtained by studying the condition number 
of g-! : (L,U) ++ L-U for componentwise relative perturbations of L and U. 
The corresponding condition number estimate (11.4) suggests immediately the 
following tightening of (15.4): 


Theorem (Wilkinson 1961). The triangular decomposition A = LU of an invertible 
matrix A yields a numerical solution & of Ax = b with the backward error 


(A+E)£=b, — |[Ellco = O (ILI - [UI lo) €mach- 


In the benign case |||L| - |U||]oo  ||Alloo, triangular decomposition is provably 
a backward stable solver, whereas in the malignant case |||L| -|U||]oo >> ||Alleo 
stability is at risk. 


Example. With this criterion, we can develop a deeper understanding for the necessity of 
pivoting by going back to the example of §7.8, which illustrated numerical instability: 


e 1 1 0 € 1 
a=(i 1) = LU, b= (01 Ae u=(5 eee O0<e<l. 


This is a malignant case of triangular decomposition since 
-1 
|| Aloo = 2 < I[|L|- |U||oo = 2. 


However, swapping both of the rows by partial pivoting, 


i ek _4t 0 4% 4 
Pa=(¢ 1) = LU, b= (2 a): la: toe) 


maneuvers the system into a benign case: due to L, U > 0, there is ||P’ Aloo = |||L| - |U| loo. 


64 Error Analysis [ Ch. Il 


15.9 Backward stability of using the triangular decomposition P'A = LR with 
partial pivoting”® therefore depends on the growth factor?? 


— WEL I Rilleo J [Lleol] Rlloo 


yA S 
(A) = Tle [Allee 


€ ||Lloo||L* Joo = too (L). (15.5) 


This factor can be bounded independently of A (for fixed dimension m): 

Lemma. For a unipotent lower triangular matrix L € K"*" with |L| < 1 it holds 
I[Llleo <m, |] L"Ileo < 2", 

and therefore y(A) < m-2"~! for triangular decomposition with partial pivoting. 


Proof. Recall from §5.4 that a unipotent triangular L = (Ajx) jk is invertible. By 
partitioning L~! into its rows zi 


the relation I = LL! yields 


and by using that L is unipotent, expansion of 


jl jl 
Njxz thatis, z= e;— )oAgz, (f= 1:m). 
=1 k 


j 
i eh otal 
ey = yo Az = 2 + 
k=1 =1 


k 


With |Ajx| < 1 and llejll = 1, the triangle inequality implies 


j-1 
lIzjlla <1+ Do lela = G@= 1m). 
k=1 


Majorizing by 2/-! = 1+ ae 2k-1 inductively yields IIZjlla < 2-1. Consequently, 
in accordance with the definition of the max-row-sum norm (§C.8), we get 


|L-* loo = max [zi]. <2". 
j=lsm 


It directly follows from |Ajx| < 1 that ||L||oo < m; thus the proof is complete. 


Triangular decomposition with partial pivoting is therefore provably backward 
stable in the following cases: 


e the dimension is small (let us say 1 < m < 10); or 


e the growth factor (15.5) fulfills y(A) ® 1. 


58See Theorem 7.11. 
>°The max-row-sum norm is invariant under column permutations: ||P’ Alco = || Alco. 


2 >> A 
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15.10 In “the wild”, very large growth factors y(A) have only extremely seldom 
been documented so far; but they can nevertheless be “artificially” constructed. 
All inequalities in the proof of Lemma 15.9 become equalities for the particular 
triangular matrix 


1 
—1 1 

L=|=4 21 “ €k"™™", 
=f. 2 fence A 


so that the upper bounds are taken: 
[|Lllo =m, IL" ||oo = 2" 7. 


This is the L factor of the Wilkinson matrix (due to |L| < 1 there is no pivoting) 


1 1 1 1 
-1 1 1 1 2 
; {29% 
= | =| i an" 


with a growth factor that grows exponentially in the dimension m: 


_ [MIL[-[Ullloo _ m+2"=2 9, 
T= Wl. a 


It is, however, Koo(A) = m. Already for moderate dimensions m, there should be 
signs of severe numerical instability. 


Example. A numerical example for m = 25 exhibits such an instability: 


Pe il = Alyse 
2*eye(m)-tril(ones(m)); AC:,m)=1; % Wilkinson-Matrix 


3 >> rng(847); b = randn(m,1); % reproducible random right-hand-side 


1 >> [L,U,p] = lu(A,’vector’); % triangular decomposition with partial pivoting 
5 >> x = U\CL\b(p)); % substitutions 
6 >> r = b —- Axx; % residual 
7 >> omega = norm(r,inf)/(norm(A,inf)*norm(x,inf)) % backward error (15.1) 
8 omega = 

7.7456e-11 


The backward error (2) is therefore by a factor of about 7(A)/2 larger than €mach: AS a 
comparison we add a backward stable solution with QR decomposition: 


6Please inform me if a relevant practical example has been “brought down”. 
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) >> [Q,R_qr] = qr(A); % QR decomposition 
>> x_qr = R_qr\(Q’*b); % substitutions 
2 >> yr_qr = b - A*x_qr; % residual 
3 >> omega_qr = norm(r_qr,inf)/(norm(A,inf)*norm(x_qr,inf)) % back. error 
omega_qr = 
4.9507e-17 


Since A with Koo(A) = 25 is well-conditioned, the discrepancy between both results also 
very clearly substantiates the instability of using triangular decomposition in this case: 


5 >> norm(x-x_qr,inf)/norm(x,inf) % relative discrepancy to QR result 
q pancy 


ans = 
1.0972e-09 


This limited accuracy fits well with the forward error estimate (15.3): Keo(A)w(%) & 2.107%. 
Here, the theoretical predictions only overestimate both, forward and backward error, by a 
factor of 2. 


Iterative Refinement 


15.11 Arguing strictly formally, we could try to correct a numerical solution % of 
Ax = b by applying the following linear system of equations for the error w: 


r=b—Af, Aw=r, x=K+w. 


Since machine arithmetic itself only provides a somewhat perturbed result 7, one 
is invited to iterate over this correction step: with xo = %, for k = 0,1,2,... 


th = b— Ax, AWe=Tk, Xkp1 = Xe + Wp. 


In this process, the matrix factorization of A only needs to be calculated once. 
Instead of asking about the convergence of this iterative refinement, one should ask 
whether the stopping criterion of backward stability will have been reached within 
a number of, say, n steps: w(%1) = O(€mach): 


15.12 This stopping criterion should be matched for a well-conditioned matrix A 
(for which forward and backward errors of the numerical solution are of the same 
magnitude), if the leading n-th part of the solution mantissa is always computed 
correctly using a fixed matrix factorization of A. Since the significant digits of 
w, start where the significant digits of x, end, a correct n-th part of the solution 
mantissa is added in every step of the iterative refinement: the first correction 
corrects the second n-th part of the mantissa, the second one corrects the third 
n-th part, etc.. This way, after n — 1 correction steps, approximately all the digits 
of the solution x have been computed correctly. 


19 
2 >> x = x + Ww; r = b — Axx; % first step of it. refinement and new residual 


21 


24 
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15.13 Assuming that «..(A) = O(1) and y(A) = O(e; 32), these preliminary 
considerations suggest that triangular decomposition is good for at least half of 
the mantissa and therefore just a single step of iterative refinement should be 
adequate for computing an accurate result.®! As a matter of fact, one can prove 
the following remarkable theorem:°? 

Theorem (Skeel 1980). For 7(A)?Xo0(A)emach = O(1), triangular de- 
composition with partial pivoting yields a backward stable solution after one 


single step of iterative refinement. 


Example. We will illustrate this theorem with the numerical example from §15.10: here, the 
prerequisite is fulfilled by means of 7(A)*oo(A)émach © 1 for m = 25. 


>> w = U\C(L\r(p)); % correction from the stored triangular decomposition 


>> omega = norm(r,inf)/(norm(A,inf)*norm(x,inf)) % backward stability! 


2 omega = 


6.1883e-18 
>> norm(x-x_qr,inf)/norm(x,inf) % relative deviation to QR result 
ans = 

9.2825e-16 


As a comparison: the unavoidable error in this case is about Koo(A) - €mach © 2° 10-}, 


Exercise. Test the limitations of iterative refinement for larger dimensions of m. 


61 This way, one still saves a factor of 2 in computational cost when compared to QR decomposition, 
cf. §§7.7 and 9.8. 

Cf, §E.1; for a proof with componentwise relative errors see R. D. Skeel: Iterative refinement implies 
numerical stability for Gaussian elimination, Math. Comp. 35, 817-832, 1980. 


IV Least Squares 


The choice of the square was purely arbitrary. 


(Carl Friedrich Gauss 1839) 


The method of least squares is the automobile of 
modern statistical analysis: despite its limitations, 
occasional accidents, and incidental pollution, it is 
known and valued by nearly all. 


(Stephen Stigler 1981) 


16 Normal Equation 
16.1 In experimental and observational sciences, one is often confronted with the 
task of estimating parameters p = (61,...,0,)/ of a mathematical model from a 
set of noisy measurements.® If the parameters enter the model linearly, such a 
set-up can be written in the form 
b=Ap-+e 

where 

e b € R” is the measured observation vector; 

e A€ R”*" is the design matrix; 

e p © R" is the parameter vector or feature vector to be estimated; 


© e = (€1,...,€m)’ is the vector of inaccessible, random perturbations (noise). 


The number of measurements m is often larger than the number of parameters n. 


®3This is referred to as data fitting or regression analysis in parametric statistics. 
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16.2 An estimator x of the parameter vector p replaces the actual perturbation e 
with the residual r where 


b=Ax+tr, r = (01,---,Pm)- 
The least squares estimator then solves the minimization problem 
2_ 2 
IIrll2 = dj = mini; 
j= 


the Gauss—Markov theorem asserts its optimality if the noise obeys certain statisti- 
cal properties: centered, uncorrelated, equal variance. For unequal variances or 
correlated noise one uses the weighted or generalized least squares method. 


Remark. The least squares method was first published in 1805 by Adrien-Marie Legendre in his 
work on trajectories of comets; it had however already been used in 1801 by Carl Friedrich 
Gauss for calculating the trajectory of the minor planet Ceres. Both mathematicians, having 
previously got in each other’s way while studying number theory and elliptic functions, 
bitterly fought for decades over their priority in this discovery. 


Exercise. Show that the least squares estimator of the parameter @ from the measurements 


is equal to the sample mean es Ee ;. Solve for it in two ways: directly and with (16.1). 


16.3 We therefore consider the least squares problem 


x = argmin||b — Ay|l>, AER™", bER"”. 
yelR" 


Equivalently, we have to minimize the function 
F(y) = 5 |b — Ay||5 = 5 (b'b — 2y'A'b + y/A'Ay) 


whose gradient is VF(y) = —A’'b + A’ Ay. The necessary optimality condition 
VF(x) = 0 is therefore equivalent to the normal equation (as introduced by Gauss) 


A’ Ax = A'b. (16.1) 


For A with full column rank (which we want to assume from now on), and thus for 
A'A s.p.d. (Lemma 9.1), the normal equation possesses a unique solution x € IR". 
This actually yields the unique minimum of F: 


F(y) = 2(b'b — 2y A’Ax + yA! Ay) = F(x) + Ay —x)/A'Aly — x) > F(2) 


with equality for precisely y = x (A’A is s.p.d.). Hence, we have proven for K = R: 


64That is to say that the it is the best linear unbiased estimator (BLUE). 
65R. L. Plackett: Studies in the History of Probability and Statistics. XXIX: The discovery of the method of least 
squares, Biometrika 59, 239-251, 1972. 
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Theorem. For A € K"*" with full column rank, the least squares problem 


x = arg min ||b — Ay]|l2, A€K™*", bE K", 
ycK" 


is equivalent to the uniquely solvable normal equation A'Ax = A‘b, where A'A is s.p.d.. 


Exercise. Show the theorem to hold for K = C, too. Hint: Separate real and imaginary parts. 


Remark. The solution map At : b++ x = (A’A)~!A’b is linear in b; the induced matrix is 
called the pseudoinverse of A and coincides with the inverse for m = n. 
16.4 The normal equation leads to the solution algorithm 


ee assembly of the » AAt solution of the 


+x (16.2) 
normal equation normal equation 

of the least squares problem (the vector b being fixed); the normal equation 

itself is of course solved with the Cholesky decomposition from §8.3.° Using the 

symmetry of A’A, the computational cost is therefore (in leading order) 


# flop(matrix product A’A) + #flop(Cholesky decomposition of A’A) 


1 
= 2 mer) 
mn* + ae 
16.5 To evaluate the stability of (16.2) using Criterion F from §13.8,° we compare 
the condition number «2(A’A) of the normal equation with that of the least 
squares problem. Since it can be shown for m = n, as was the case in §15.7, that 


K2(A) = \/K2(A'A), (16.3) 


we define (A) as (16.3) for the case of m > n. The relative condition number x; 
of the least squares problem with respect to perturbations in A is bounded as 


max (K2(A),0 Ko(A)*) < Kis < (A) + w+ (AP? (16.4) 


with the relative measure of the residual (cf. (15.1)) 


IIrll2 
y 
I|All2 Ill 


Cholesky had originally developed his method for the system of normal equations in the field of 
mathematical geodesy. 

6’ The start section is in this case not invertible, so Criterion B is not applicable. 

68In contrast, it holds for perturbations in b 


wk(A) < Krg = k2(A)|b]]2/(|Allzl|xll2) < (1 + @)x2(A). 


See P.-A. Wedin: Perturbation theory for pseudo-inverses, BIT 13, 217-232, 1973; A. van der Sluis: 
Stability of the solutions of linear least squares problems, Numer. Math. 23, 241-254, 1975. 


r=b—Ax. 


w= 
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Therefore, (16.2) is likely unstable if the residual r is relatively small (w < 1) and 
the matrix A is ill-conditioned (x2(A) >> 1): it then holds 


Kis < (K2(A)~ | +.w)x2(A)* < K2(A)* = k2(A’A). 


In contrast, for all other cases, i.e., for relatively large residuals or well-conditioned 
matrices, the use of the normal equation is stable. 


Example. The numerical instability of the algorithm (16.2) can famously be illustrated with 
the following vanishing-residual example (P. Lauchli 1961): 


1 1 2 
= _ _f\\. a, (lite 1 ry, 2 
a=(ea).o=(e).2=(): wa=(4% 2a) mutn=3en 


€ 


Here, A’A is actually exactly numerically singular for e7 < 2€mach since then 1 + «7 = 1 
(the information about € is completely lost while assembling A'A; after the start section, 
the parameter € can no longer be reconstructed, cf. Criterion B from §13.8). For a slightly 
larger € we obtain the following numerical example: 


>> e = 1e-7; Re=aome 
>> A = [1 1;e 0;0 e]; b = [2;e;e]; % Lauchli example 


5 >> x = (A’*A)\(A?*bd) % solution of the normal equation 


x = 
1.011235955056180e+00 
9.887640449438204e-01 


The loss of 14 decimal places is consistent with x2(A'A) ~ 2-10!4; however, due to the 
condition number xis = k2(A) © 107, at most a loss of approximately 7 decimal places 
would be acceptable. 


17 Orthogonalization 


For the following, we will assume that A € K"*" has full column rank and, therefore, m > n. 


17.1 A stable algorithm for the least squares problem should access A directly®? 
and not via the detour of A’A. With the help of a (normalized) QR decomposition 
A = QR, we can simplify the normal equations according to Theorem 9.3 to 


A'Ax= A’ b 
QQ “WY 


=R/R =R’/Q! 
and therefore equivalently reduce via multiplication with (R’)~! to 


Rx = Q'b. (17.1) 


Tn 1961, Eduard Stiefel spoke of the “principle of direct attack” in the field of numerical analysis. 
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The backward stability of this algorithm—with the modified Gram—Schmidt, Givens 
or Householder method for the calculation of the reduced QR decomposition—can 
be understood as in §15.6 by the section decomposition 


A reduced ( ) multiplication with Q’ : 
SS. t x 
QR factorization : backward substitution 


Here, due to the column orthonormality of Q, the inverse start section, that is 
(Q,R)++ Q:R=A, is actually well-conditioned: || Q||2||R|/2 = ||All2- 

Exercise. Show for the full QR decomposition (9.3): r = b — Ax = (I— Q1Q))b = QoQhb. 
Explain why only the third formula is stable and should therefore be used for the residual. 
17.2 By means of a simple trick, we can actually by and large forgo explicit knowl- 


edge of the factor Q. To do so, we calculate the normalized QR decomposition of 
the matrix A augmented by the column J, i.e., 


Cl 


By expanding we obtain not only the QR decomposition A = QR, but also 


R|z 


b= Qz+ pq. 
Due to Q'Q = I and Q’q = 0, multiplication with Q’ provides the relationship 
Q'b =z. 
Thus, the constitutive equation (17.1) can be simplified to 
Rx =z. 
Further, qq = ||q||5 = 1 implies that the positive p is the norm of the residual: 
p = ipalla = Ilo — Qzlla = |b — QRxllp = I]b — Axl. 
To summarize, this algorithm can be written in MATLAB as follows: 


Program 12 (Q-Free Solution of the Least Squares Problem). 


Notice: in MATLAB the QR decomposition is not normalized to a positive diagonal of R. 


R = triu(qr([A b])); % R-factor of the matrix A extended by b 
2 GR Gene dem NR Gemen tdi ee, solutwonmod nit, 
3 rho = abs(R(nt+i,n+1)); % norm of the residual 


As for linear systems of equations, the least squares problem is briefly solved by the line 


x = A\b; 
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Example. Let us re-examine the numerical example from §16.5: 


>> x = A\b 


9° 999999999999996'e—01' 
1.000000000000000e+00 


Despite the relatively large condition number xj5 = K2(A) © 107, almost all digits are 
correct; this time the machine arithmetic does not “trigger” the worst case scenario. 


Exercise. Let the columns of a matrix B € R™*" constitute a basis of the n-dimensional 
subspace U C R". For x € IR", denote by d(x, U) the distance from x to U. 

e Render the calculation of d(x, U) as a least squares problem. 

e Write a two line MATLAB code that calculates d(x, U) given the input (B, x). 


17.3 The computational cost of such a Q-free solution of a least squares problem 
using the Householder method is, according to §§9.8 and D.5, 


2mn?2 — 2n3 /3 flop. 


For m > n, the computational cost is thus two times larger than the cost to 
calculate the solution via the normal equation with Cholesky decomposition, 
which in accordance with §16.4 is 


mn? +n? /3 flop. 


For m © n, both are approximately 4m? /3 flop and therefore of comparable size. 
Therefore, both algorithms have their merit: the normal equation with Cholesky 
should be used for m >> n with a relatively large residual (in the sense of w & 1) 
or a comparatively well-conditioned matrix A; orthogonalization should be used 
in all other cases (cf. §16.5). 


Exercise. Let a matrix A € R™*" be given with full row rank m < n. Examine the under- 
determined linear system of equations Ax = b. 


e Show that there exists a unique minimal solution x. for which it holds 
Xx, = argmin{||x||2: Ax =D}. 


This solution is given by x» = A'w, where w solves AA’w = b. 
e Explain why this two-step algorithm is potentially unstable. 

Hint: Use, without proof, the fact that”? x(x.; A) < 22(A) with x2(A) = \/x2(AA’). 
e Develop an efficient and stable algorithm to calculate x,. 


e Supplement this algorithm with the calculation of ker(A). 


See Theorem 5.6.1 in G. H. Golub, C. F. Van Loan: Matrix Computations, 4th ed., The Johns Hopkins 
University Press, Baltimore, 2013. 


V Eigenvalue Problems 


When | was a graduate student working at Oak Ridge, my office 
was dubbed the Eigencave. 


(Pete Stewart 1998) 


While in theoretical mathematics a problem can often be 
simplified by means of transformations, the indiscriminate 
application of such methods in numerical mathematics often leads 
to failure. This is due to the fact that the transformed problem 
contains less numerical information than the original problem. 


(Eduard Stiefel 1961) 


18 Basic Concepts 


18.1 An eigenpair (A,x) € C x C™ of a matrix A € C”*"™, composed of an eigen- 
value A and an eigenvector x, is defined by 


Ax = Ax, x #0; (18.1) 


the set of eigenvalues is the spectrum o(A) of A. Since the eigenvalue equation is 
homogeneous in x, we often consider eigenvectors normalized by ||x||2 = 1. 


18.2 Obviously, A is an eigenvalue, if and only if AI — A is singular and therefore 
A is a root of the characteristic polynomial 


x(G) = det(¢I — A); 
the multiplicity of this root is the (algebraic) multiplicity of the eigenvalue. 


Remark. The fundamental theorem of algebra ensures the existence of eigenvalues A, 
corresponding eigenvectors are obtained, at least theoretically, as a kernel vector of AI — A. 


18.3 The connection to the characteristic polynomial x suggests an algorithm for 
calculating the eigenvalues by first finding the coefficients and then the roots of x: 


A ig aaa: 


Unfortunately, this is numerically unstable due to the ill-conditioned end section h. 
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Example. The application to a diagonal matrix with the eigenvalues 1,...,22 yields: 


>> A = diag(1:22); y/ diagonal matrix with eigenvalues 1:22 
>> chi = charpoly(A); / coefficents of the charachteristic polynomial 
3 >> lambda = roots(chi); % roots 


4 


>> lambda (7:8) 


5 ans 


15.4373281959839 + 1.07363608241233i 
15.4373281959839 - 1.07363608241233i 


A comparison with the nearest eigenvalue A = 15 yields an absolute error of © 1. We will 
see in §19.4 that only an absolute error of O(€mach) would be acceptable because of the 
well-conditioned nature of the eigenvalue problem itself. 


Exercise. Find the condition number of a simple root of the polynomial x as a function of 
its coefficients; use componentwise relative error for the coefficients and absolute error for 
the roots. What is the condition number for A = 15 in the example above? 


Answer: «(A) = x2(\A|)/|x! (A) |, where the polynomial x! is obtained from x by applying 


absolute values to the coefficients. In the example, «(15) © 6- 101%, so that «(15)» mach © 6: 


8 >> lambda = 15; 
9 >> kappa = polyval(abs(chi) ,lambda)/abs(polyval (polyder (chi) , lambda) ) 
i0 kappa = 


1 


5.7345¢e+16 


Remark. In numerical analysis, one goes the other way round: the roots of a polynomial 


CT $m 1G" +++ tal +49 


are namely the eigenvalues of the corresponding companion matrix 


0 —Xo 

1 ky 
0 —Xm—2 
1 —Xm—-1 


18.4 If the eigenpair (A, x) is already known, it can be “split off” from A. To do 
so, we extend the normalized eigenvector x to an orthonormal basis, that is, let 


o=(x|u) 


be unitary.”! It then actually holds 


x! A | x’/AU 
Q4Q=|—, (ax | Au) = re 


’ A, = U'AU, 


71Such a Q can be constructed by a complete QR decomposition as in Theorem 9.10: [Q,~] = qr(x). 


Sec. 18] Basic Concepts 77 


so that because of (an implicit polynomial division) 
det(@I — A) = (€ — A) det(ZI — A) 
the eigenvalues of the deflated matrix A, supply the remaining eigenvalues of A. 


This reduction of dimension is called deflation of A. 


Exercise. Given an eigenpair (1, y) of Ay, construct a corresponding eigenpair (1,z) of A. 


18.5 By applying the deflation technique once more to the deflated matrix A) 
itself and by repeating this process recursively, we obtain, row-wise from top to bot- 
tom, a Schur decomposition or unitary triangulation (for a column-wise construction 
see §21.8) 


i 


Q"AQ=T= : a ; o(A) = {A1,...,Am}, (18.2) 
* 
Am 
with a unitary matrix Q and an upper triangular matrix T, where the eigenvalues 
of A appear on the diagonal of T according to their multiplicity. 


Remark. Numerically, a Schur decomposition is always, without exception, to be preferred to 
the Jordan normal form (why?). For most applications, all of the information of interest in 
the Jordan normal form can essentially already be found in a Schur decomposition. 


18.6 If A is normal, i.e., A'A = AA’, it follows from QQ’ = I that also T is normal: 
T'T = O'A'Q0'AQ = O'A'AQ = O'AA'O = O'AQQ'A'Q = TT". 
Normal triangular matrices are diagonal: the proof goes inductively, by writing 
tlt 


T= 
* 


and expanding T’T = TT’, one gets as the upper left entry 
||? = |t|7 + |l¢||3, thus t=0. 


Thus, a normal (and hence, a fortiori, self-adjoint) matrix A can be unitarily diago- 
nalized: there is a unitary matrix Q, such that 


Q'AQ =D =diag(Ay,...,Am),  0(A) = {A1,---, Am}- (18.3) 


Inversely, unitary diagonalizable matrices are, of course, always normal. 

Remark. Since AQ = QD, every column vector of Q is an eigenvector of A; hence, every 
normal matrix possesses an orthonormal basis of eigenvectors. 

Exercise. Show that for a self adjoint matrix A, the eigenvalues are real. For a real self-adjoint 
matrix A, the matrix Q can also be constructed as real. 
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19 Perturbation Theory 


Backward Error 


19.1 Let the matrix A be given along with a perturbed eigenpair (A,x), x # 0. 
As with linear systems of equations, we define backward stability as the smallest 
perturbation of A for which this pair becomes an exact eigenpair: 


This quantity can be directly calculated with the help of Theorem 15.2 of Rigal 
and Gaches by setting b = Ax: 


IIrlle 
y 
I|Allallxll2 


Thus, a calculation of a numerical eigenpair with w = O(€mach) is backward stable. 


w= r= Ax—Ax. 


19.2 Let an approximate eigenvector x 4 0 be given. Which A € C is the best 
approximate eigenvalue fitting to it in terms of the backward error w? Since the 
denominator of the expression w does not depend on the approximate eigenvalue, 
it suffices to minimize the residual itself: 
A = argmin || Ax — x- pul|2. 
pec 

The normal equation (cf. Theorem 16.3) of this least squares problem provides (no- 
tice here that x plays the role of the design matrix and Ax that of the observation) 
x! Ax 


, 


x'x-A =x'Ax, suchthat A= 


x!x 
this expression of the optimal A is called the Rayleigh quotient of A in x. 
19.3 Now, let an approximate eigenvalue A be given. Its absolute backward error 
y = min {||El]2: A € o(A+E)} = min {||El|2 : AI — (A+E) is singular} 
is exactly 7 = ||AI — Al|2/x2(AI — A) by Kahan’s Theorem 11.9, that is, we have 
|Al—A) Iz", Ago(A), 
n = sep(A, A) = (19.1) 
0, otherwise; 
where sep(A, A) is called the separation between A and A. According to §11.2, the 


absolute condition number xk p,(A) of an eigenvalue is therefore given by 


A-A 
Kabs(A) = limsu | . | , 
Asa’ sep(A, A) 


o(A). 


Exercise. Show that sep(A, A) < dist(A,a(A)) and therefore invariably xp5(A) > 1. 
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Conditioning of the Eigenvalue Problem with Normal Matrices 
19.4 In the case of normal matrices, the separation is the distance to the spectrum: 
Theorem. If the matrix A is normal, then it holds sep(A, A) = dist(A,o(A)). 
Proof. Unitary diagonalization (18.3) implies that 
(AI— A)“ = Q(aT-D)“1Q' 


with a diagonal matrix D composed of the eigenvalues of A and therefore 


1 1 
AI— A)7||2 = || AI— D)7 |lo = = ; 
II( ) Ilo I ¢ ) Ilo nena [A _ u| dist(A,0(A)) 


diagonal matrix 


Here, we have used the unitary invariance of the spectral norm and (C.4). 


Exercise. Construct an A with 0 € o(A) such that dist(A,a(A))/sep(A, A) > 00 for A > 0. 


Together with (19.1) this theorem directly shows that the eigenvalues of normal 
matrices are always well-conditioned in the sense of absolute error: 


Corollary (Bauer—-Fike 1960). Let A be normal. It then holds that for A € o(A +E) 
dist(A,a(A)) < ||E|l2. 

The absolute condition number of an eigenvalue A € o(A) satisfies Kgps(A) = 1. 

Exercise. Show for diagonalizable A = XDX~! that dist(A,o(A)) < k2(X)||Ell2. 

19.5 For simple eigenvalues, the condition number of the corresponding eigenvec- 


tor can be estimated at least. Since only the direction of eigenvectors matters, we 
measure their perturbation in terms of angles: 


Theorem (Davis—Kahan 1970). Let (A,x) be an eigenpair of a normal matrix A, where 
the eigenvalue A is simple; furthermore let (A,X) be an eigenpair of A + E. It then holds 


NE, 
dist(A, o(A) \ {A}) 


| sin £(x,%)| < 


For this error measure, the condition number is bounded by x < dist(A,a(A) \ {A})~1. 
This distance is called the spectral gap of A in A. 


Proof. Without loss of generality x and % are normalized. The construction from 
§18.4 leads to 


1= \l£I3 = |Q'E3 = |x'x|* + \|U'z||z, ie, [sin Z(x,%)| = ||U'R||2. (19.2) 
ey 
=cos? £(x,£) 
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From U'A = A,U! (why?) and Ex = A% — AX follows 
U'EX = Au’ — U' A = (AI— A,)U'%, thus U’% = (AI— A,)~!U'ER, 
so that by ||U’||2 = 1 and ||%||2 = 1 we obtain the bound (for general A up to here) 
sin £(x,£) < ||(A1—Aa)“MllalElla = IIEll2/ sep(A, Ay) . 


Along with A, the matrix A, is also normal (cf. §18.6) and since A is simple it 
holds a(A,) = o(A) \ {A}, so that Theorem 19.4 finally yields the assertion. 


Remark. Determining eigenvectors belonging to degenerate eigenvalues is generally an ill- 
posed problem. For instance, every direction in the plane constitutes an eigenvector of the 
identity matrix I € C2*2 with double eigenvalue 1, but only the directions of the coordinate 
axes correspond to the eigenvectors of the perturbed matrix I + E = diag(1+ e,1) for an 
arbitrarily small e 4 0. 


Exercise. For x #0 let P = xx'/(x'x) be the orthogonal projection onto span{x}. Show: 


| sin £(x, £)| = ||P(x) — P(%)Il2- 


19.6 For non-normal matrices A, the study of e-pseudospectra (with small e > 0) 


oe(A) = {A € C: sep(A, A) <e} 9 {ACCA € o(AFE) fora |lEll2 <e} 


is often much more meaningful than just the examination of 7(A). This is a very 
attractive area of research with many applications from random matrices to card 
shuffling to the control of flutter instability for commercial airplanes.” 


20 Power Iteration 


20.1 Since according to §§18.2-18.3 the eigenvalue problem is formally equivalent 
to calculating the roots of polynomials, the Abel—Ruffini theorem tells us that 
for dimensions m > 5 there exists no “solution formula” that would just use the 
operations +, —, X, Ve yf . Instead, one constructs iterative sequences 


(Ho, v0), (141,01), (p2, v2), tee 


of approximate eigenpairs and studies their convergence. 


72See the monograph by L. N. Trefethen, M. Embree: Spectra and Pseudospectra. The Behavior of Nonnormal 
Matrices and Operators, Princeton University Press, Princeton and Oxford, 2005. 
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20.2 In practice, such iterations are stopped, at the latest, when backward stability 
is reached at the nth iterate (cf. §19.1):” 


rll 

Wn = O(€mach), Wn = TAlblonla’ tn = AVn — UnVn. 
Since the spectral norm of a matrix would have to be approximated as an eigen- 
value problem itself (§C.8), which is costly and would be a circular reasoning here, 
we replace it with the Frobenius norm and use the following bound instead (§C.9) 


[ral 


O, = — On < Wy < JM Gn. (20.1) 
" Allellenll2 ox 7 


Concretely, the approximation is stopped at a user defined @, < tol or one iterates 
until @, saturates at the level O(€mach) of numerical noise (cf. Fig. 1). 


a + 6=0.92 
107, + 6=026 | 


10-81: a 


backward error estimate 


100 200 300 400 500 


iteration index k 


Figure 1: Convergence of the power iteration for two different normal 1000 x 1000 matrices. 


20.3 By considering (18.1) a fixed point equation for the direction of the eigenvec 
tor x, the fixed point iteration k = 1,2,... starting at a vector vp is: 


We = AVE} (application of the matrix) 

Ve = We/ || Well2 (normalization) 

Ue = 0, Ad, (Rayleigh quotient as in §19.2) 
we 
=U 


In the literature, this is called power iteration, because v, is the vector obtained 
from normalizing Avg (R. von Mises 1929). 


This non-asymptotic use of the O-notation is “par abus de langage” (N. Bourbaki); in this case it 
ymp P Bag 


means Ww; < C€mach With a suitably chosen constant c, which we allow to be polynomially dependent 
on the dimension m (say, to be concrete, c = 10m). 
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Program |3 (Power Iteration). The program is organized in such a way that interme- 
diate results are never re-calculated but stored instead. 


normA = norm(A,’fro’); % store ||Allr 
omega = inf; % initialize the backward error 
3 w = Atv; ¥, initialize w= Av 

while omega > tol % is the backward error still too large? 
v = w/norm(w); % new v 
w = Ax*v; i new w 
mu = v’*w; % Rayleigh quotient 
r=w - mutv; % residual 
omega = norm(r)/normA; % backward error estimate W 

end 


20.4 Power iteration generally converges to an eigenvector corresponding to the 
eigenvalue of largest size as long as it is unique, then called the dominant eigenvalue. 
For the sake of simplicity, we limit ourselves to normal matrices: 


Theorem. Let it be possible to order the eigenvalues of a normal matrix A as 


Ar 
|Aq| > |Az| >-:+ > |Am|, so that especially @ = |<? 


1 
MG 


Let x be a normalized eigenvector corresponding to the dominant eigenvalue Aj. If the 
starting vector v9 satisfies the condition’* xv 4 0, power iteration calculates approxi- 
mate eigenpairs (Uz, VK) converging as’? 


sin (v%,x1) = O(0*), [ue —Ax| = O(0**) (kK 4 &). 
Proof. We will use the unitary diagonalization of A from §18.6 in the form 
Mi 


Q'AQ = Sly Q=(n|U), D = U'AU = diag(Ay,...,Am). 


Using the transformed vector y = U’vg it follows from the diagonal structure that 
Vo = 1X1 + Uy, 91 = X00 #0, Akog = my Ax, + UD‘y. 


The diagonal matrix D satisfies ||D* ||) = |A2|k = |A,|*e* (cf. §C.11). Therefore, 
from 11 At # 0 we obtain 


Akog = mAk(x1 + O(0")) ~ mAkx, (kK 3 0). 


Thus Avo becomes asymptotically parallel to the eigenvector x; at a rate of O(6*). 


”4This condition is best realized with a random choice of vg. 
75We do not claim that the vg converge itself: in general, v, = o,x1 + O(6") with phases (phase = 
“complex sign”) |o%| = 1 which do not necessarily converge—“the signs may jump.” 
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The precise convergence estimates can be accounted for as follows: since 7; and A‘vg 
only differ by a normalization, (19.2) provides 


_ [Ul AFeollo \|D*yllo 
|A¥zoll2 [lm Ax, + UDKy||2 


| sin £ (og, x1)| = ||U’o|2 


|| D¥y|| 6 llyllo/|m| k 
Iya ||AalF — ]D¥y|l2 ~ 1— OF ily|l2/ lik 


where we have used 1;Ak 4 0 and, eventually, 0*||y||2/|71| < 1. Also due to (19.2), the 
approximate eigenvalue pi; satisfies 


Uy = VAD, = Aq oxy |? + HLUDU! vp = Aq — Aq||U!O4||5 + HL UDU' a, 
Sm’ 
=1—|U'2% 3 
so that finally from ||D||z = |A2| the estimate 


[pe — Aal < ([Aa] + [Aal) UW I5 = (lal + [Aal) sin? £(og, 21) = O(6"*) 


appears and everything is proven. 
Exercise. Show that the convergence rate of the backward error of (1p, 0%) is wy, = O(0*). 
Explain the convergence plot in Fig. 1 both qualitatively and quantitatively. 

Hint: Backward stability is reached at about 15/|log,, 6| * 50/| log, 6| iteration steps. 


Remark. According to Theorem 19.5, an ill-conditioned eigenvector x; implies A, ~ Az, so 
that then, due to 6 = 1, power iteration converges only very slowly. 
Inverse Iteration 
20.5 We notice the basic, but extremely useful equivalence 
(A,x) eigenpairof A = ((A—yp)~1!,x) eigenpair of (A—pI)~}. 


With the help of a shift 1, that is closer to a simple eigenvalue A than to the rest of 
the spectrum of A, the transformed eigenvalue (A — y)~! can be made dominant 
for the transformed matrix S = (A — yI)~!. By Theorem 20.4, power iteration 
applied to the matrix S now converges towards the eigenvector x. In this way, 
simple eigenpairs can be selectively computed. 


20.6 This yields the inverse iteration with shift  (H. Wielandt 1944), fork = 1,2,...: 


(A = pl)w, = 04-4 (linear system of equations) 
Vp = Wk/||Wzll2 (normalization) 
Up = V, AD, (Rayleigh quotient from §19.2) 


Since the matrix of this linear system of equations is not changed during the 
iteration, just a single matrix factorization needs to be calculated (which is an 
enormous advantage for large dimensions, even for a limited number of iterations). 
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20.7 To avoid the explicit matrix multiplication Av, during the evaluation of 
the Rayleigh quotients py, = 0, AD, and residuals rg = Av, — Mz0,, the following 
auxiliary quantities are introduced: 


OkK-1 


[le Avg — HOR = Te + (Mk =H) Pk, Pk = OZ = OAV — = Me - Hl. 


This way we obtain wy, = p+ pg and ry = Z_ — pdx, thereby cutting the cost of 
each iteration step in half (2m? instead of 4m? flop). 


Program 14 (Inverse Iteration with Shift). 


normA = norm(A,’fro’); % store ||Allr 
2 omega = inf; % initialize the backward error 
3 [L,R,p] = lu(A - muShift*eye(m),’vector’); % store the triangular decomp. 
while omega > tol % is the backward error still too large? 
w = R\(L\v(p)); % solve system of linear equations 
normW = norm(w); % ||wll2 
z = v/normW; % auxiliary quantity z 
v = w/normW; i new v 
rho = v’*z; % auxiliary quantity 0 
mu = muShift + rho; % Rayleigh quotient 
r= z) — rhorv. y%, residual 
omega = norm(r)/normA; % backward error estimate W 
3 end 


20.8 Theorem 20.4 now readily implies a convergence result for inverse iteration: 


Theorem. Let it be possible to order the eigenvalues of a normal matrix A with respect 
to a given shift w € C as 

Ay = Hv 
A2 —H 


|Ay — p| < |A2—p] <--> < Am — pl, so that especially 0 = | | <1. 


Let the vector x, be a normalized eigenvector corresponding to the eigenvalue A4. If 
the starting vector vo satisfies the condition x',v9 # 0, inverse iteration with shift pu 
calculates approximate eigenpairs ([A,, Vg) converging as 


sin £(04,21) = O(0*), Ite —Ar] = O(@*) (K-40). 


A Misguided Objection: IIl-Conditioning of the Linear System to be Solved 


We assume that A € C™*™ is normal. 


20.9 If the shift » is very close to the spectrum of A, then the linear system of 
equations for the inverse iteration, that is, 


(A-pIw=2, — |lel2=1, 
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is ill-conditioned: indeed in keeping with §§19.3-19.4 we get 


= max A- 
JA —plll2 veo(A) |A— HI Sa 


Kea( A = pel) = dist(w,0(A)) — minjeg(a) |A — HI . 


On the computer, we compute a @ that only approximates w very inaccurately. 
Does this point towards a possible numerical instability of the inverse iteration? 
20.10 The answer is a resounding “no!”, because we are not at all interested in 
the vector w itself, but rather only in its direction (i.e., in the normalized vector 
w/ || w||2).76 A backward stable solution @ of the linear system actually satisfies 


(A+E)—pI@=0, — |Ell2 = O(All2 - €mach)- (20.2) 


Were the perturbation E to the be same for all iteration steps (cf. the model for 
iterative refinement in §E.1), we would directly get convergence to an eigenvector 
of A+ E and thus achieve backward stability. With just a little more work, this 
analysis can be carried over to the case where E depends on the iteration index. 


20.11 At the extreme, inverse iteration can be used to compute an eigenvector 
corresponding to a backward stable eigenvalue p taken as shift. Then §19.4 implies 


dist(p,7(A)) = O(\|All2 ‘ Smack) 


so that the iteration matrix A — pl is even numerically singular. Combined with p, 
an approximate eigenvector @ calculated in course of inverse iteration forms an 
approximate eigenpair (y, 7). Using §19.1, its backward error can be bounded by 


Ad — p||2 _ |lo— E@|l2 — 1+ [Ellallella 1 


w= — = - < = = = t Ole h) 
|Alleil@ll2 ~~ Allall@ll2 ~  [Allall@ll2 (Alle il@l, © °° 


If 1/(||All2||@||2) = O(€mach), the eigenpair is thus already backward stable without 
the exact length of ||w]||2 having played any particular role. In principle, if ||v||2 = 1, 
this magnitude is within reach as suggested by the following lower bound: 


1 s 1 _ dist(u,0(A)) 
|Allellell2 ~ |[All2|(A— #1)~*Mlalloll2 [Alle 


Actually, for a randomly chosen normalized vector v, there is a high probability of 
equal order of magnitude for both sides of this bound.”” We are thus led to claim: 


— Ol(Emach) : 


76“The failure to grasp this fact has resulted in a lot of misguided papers.” (Pete Stewart 1998) 
77More specifically, random vectors v generated by “v = randn(m,1); v = v/norm(v) ;” satisfy (for 
a proof see §E.15), with probability > 1 — 6, the bound 


1 


[Abloh <O(STemach)  (0<6 <1). 
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Using a backward stable approximate eigenvalue as shift, just a single step 
of inverse iteration applied to a random starting vector computes, with high 
probability, a backward stable approximate eigenpair. 


Exercise. How can one avoid arithmetic exceptions entering during inverse iteration if the 
iteration matrix A — pI is exactly singular in machine arithmetic? 
Example. Here, despite the warning 

Matrix is close to singular or badly scaled. Results may be inaccurate. 


issued by MATLAB at line 6 of the following code, a backward stable eigenvector to a given 
eigenvalue has been found after just one single step of inverse iteration: 


1 >> m = 1000; rng(847); % initialization 
2 > lambda = (iim) + (-S:m-6)4i; % given eigenvalues 
3 >> [Q,~] = qr(randn(m)); A = Q*diag(lambda)*Q’; % suitable normal matrix 
4 >> mu = 1 —- 5i; % shift = known eigenvalue 
5 >> v = randn(m,1); v = v/norm(v); % vandom start vector 
5 >> w = (A - muxeye(m))\v; % 1 step inverse iteration 
7 >> omega = sqrt (m)*norm(A*w-mu*w)/norm(A,’fro’)/norm(w) % Eq. (20.1) 
8 omega = 
1.3136e-15 


21 QR Algorithm 


21.1 For select eigenvalues A of a matrix A corresponding Schur decompositions 
can be obtained in the form 


* | 


QAQ=T= : 


The last row of this decomposition states e/,,Q’AQ = Ae},,, so that for x = Qe, #0 
x A=Ax', x #0. (21.1) 


Such an x is called a left eigenvector of A corresponding to the eigenvalue A; (A, x) 
is called a left eigenpair. 


Remark. Adjunction takes (21.1) to the equivalent form A’x = Ax. It therefore holds that 


o(A’) = 0(A) and the left eigenvectors of A correspond with the eigenvectors of A’. Hence, 
all of the concepts from this chapter carry over from eigenpairs to left eigenpairs. 


21.2 Starting with Ao = A, we now aim at iteratively computing just the last line 
of a Schur decomposition by constructing a sequence 


* | 


Ag+ = Q) A, Qk — Xd (k — co) (21.2) 
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where all the base changes Q; are meant to be unitary. Notably then, both A; and 
A are unitarily similar which implies that spectrum and norm remain invariant: 


o(Ax) =o(A), — ||Agll2 = IAll2- 


The last line of (21.2) amounts to saying that the last unit basis vector ej, asymp- 
totically becomes a left eigenvector of Ax corresponding to the eigenvalue A: 


eA Ae, (ko). 


21.3 The partitioning 


k 2k 
Ap = 

/ 

ry. Ax 


shows the Rayleigh quotient for the approximate left eigenvector e, of A; to be 
given by 
Ax = Ch, Am 


and the corresponding residual to be of the form 
Chum Me = (| Ole 


The backward error of the perturbed left eigenpair (Ax, em) is thus wz = ||r¢||2/||All2 
and it vanishes in the limit if and only if ||r;||2 — 0 as required in (21.2). 


21.4 Let us construct A,;1 from A; by improving e, as an approximate left 
eigenvector with one step of inverse iteration (with a yet unspecified shift u;): 


(1) solve w;.(Ag — Mel) = ein; 

(2) normalize vz = w/||wxll2; 

(3) extend to an orthonormal basis, i.e., construct a unitary Q, = ( * | 2K) F 

(4) change basis according to Ay. = Q)A¢Qy, such that v, becomes en = Q)0q. 


21.5 The crucial point is now that we attain steps (1)—(3) in one fell swoop by using 
a normalized QR decomposition for the solution of (1):78 


* * 
Ag — Hel = QkRx, Rp = ij 
Pk 


78We follow G. W. Stewart: Afternotes goes to Graduate School, SIAM, Philadelphia, 1997, pp. 137-139. 
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where Q; is unitary and R; is upper triangular with a positive diagonal. From the 
last row of Rx, namely e,,Ry = Prey, We obtain the last row of Q), in the form 


atl y —1 i" -1 
CmQ = = Cm Re(Ak — el) = Pk + @m( Ak — Bel), 
SY’ Ne ee 

normalized vector >0 =U, 


that is, 

Cn Qk = Or 
which means that the last column of Q; is already the vector vg. In addition, 
step (4) can be briefly written as: 


Agr = QARQK = Q(Ak — MeL) Qe + Mel = QYQRKQK + Hel = ReQe + pel. 


21.6 We have thus reached one of the most remarkable, important and elegant 
algorithms of the 20th century, namely the QR algorithm, independently”’ de- 
veloped from John Francis (1959) and Vera Kublanovskaya (1960): for Ag = A,a 
sequence of suitable shifts #, and k = 0,1,2,... the algorithm proceeds by 


Ax — Url = Qy Re (QR factorization), 


Agay = Rp Qa + pel (matrix product RQ). 


For constant shift u, = p this iteration is, by construction, nothing else than inverse 
iteration with continued unitary basis changes that are keeping the left eigenvector 
approximation fixed at the last unit vector ej. In this case, the convergence theory 
from Theorem 20.8 is directly applicable. 


21.7 Although the sequence of the A, is already converging itself, at least under 
proper assumptions, towards a Schur decomposition (in fact even without shifts, 
see §F.2), it is nevertheless more efficient—and conceptually also clearer for the 
selection of suitable shifts 4,—to split off sufficiently accurate eigenvalues row-wise 
from bottom to top. The QR algorithm is therefore stopped when, partitioning 


By | Wy 
oe ri, | An 
as in §21.3, there is backward stability of the left eigenpair (Ay, em): 
I|rnll2 = O(IAll2 - €mach): 


At this point, we perform a numerical deflation in such a way that 


Wn 


— Ay En, En = em (1, | 0) , 


7G. Golub, F. Uhlig: The QR algorithm: 50 years later, IMA J. Numer. Anal. 29, 467-485, 2009. 
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ie., we simply make e an exact left eigenvector of a slightly perturbed matrix Ay. 
By construction, this perturbation fulfills 


|[En|l2 = |lemll2 > [Irn|l2 = O(||All2 - mach): 


From 
An = Oo 4 i QoA Qo-+* Qn-1 = U,AUn, 
e4~ -_—_- 
=U), 


the unitarity of U, and the unitary invariance®® of the spectral norm it follows that 
An =U,AUn, A=A+ E,, — [Ell = O(||All2-€macn)- 


UnEn Uh 


In this way, (An, Unem) is an exact left eigenpair of the perturbed matrix A and 
therefore a backward stable left eigenpair of A. 


Program 15 (QR Algorithm). 


function [lambda,U,B,w] = QR_algorithm(A) 


3 [m,~] = size(A); I = eye(m); U = I; %, initialization 
normA = norm(A,’fro’); % store ||Allr 
5 while norm(A(m,1:m-1)) > eps*normA % backward error still too large? 
Hil = ooo % select shift pm, (§§21.10/21.12) 
[Q,R] = qr(A-mu*TI) ; % QR decomposition 
A = R*Q+mu*I; % RQ multiplication 
U = U*Q; % transformation matrix Up = Q):::Q 
end 
lambda = A(m,m); % eigenvalue 
2B = A(1i:m-1,1:m-1); %, deflated matrix 
3 w = A(i:m-i,m); % last column in Schur decomposition 


21.8 Similarly, we notice that every backward stable left eigenpair of the deflated 
matrix By, belongs to a backward stable left eigenpair of A. If we now apply 
the QR algorithm to the matrix By, which is one dimension smaller, followed 
by numerical deflation, and repeat this process over and over again, we obtain, 
column by column, a backward stable Schur decomposition of A (cf. §18.5). By 
suppressing indices and skipping backward stable perturbations for simplicity, 
the structure of one step of this process can be written as follows: if 


B|W 


QAQ= T 


8°Once more we see the fundamental importance of unitary transformations in numerical analysis: 
perturbations do not get amplified. 


90 Eigenvalue Problems [Ch. V 


is already calculated with Q unitary and T upper triangular, the application of 
the QR algorithm on B with subsequent deflation provides 


B| w 
UB = 
A 


Combined, these two sub-steps result in the same structure as we started with, 


u! U | u'BU | U'W 
A = 
7 | 24k I | T 
Blw - 
U'w B| x 
= A = A * — = r 
T 
T T 


but with the dimension of the new triangular matrix T in the lower right block 
being increased by one and the eigenvalue A being added to its diagonal. 


21.9 In MATLAB, this process of computing a Schur decomposition Q’AQ = T 
can be executed in situ as follows: 


Program 16 (Schur Decomposition). 


object MATLAB for column k 


W T(1:k,k+1:m) 
w T(1:k-1,k) 
A T(k,k) 


T = zeros(m); Q = eye(m); % initialization 

for k=m:-1i:1 % column-wise from back to front 
[lambda ,U,A,w] = QR_algorithm(A); % deflation with the QR algorithm 
Alcs alerts) = Ws pg ileik) ee % transform first k columns of Q 
T(1i:k,k+i:m) = U’*T(1:k,k+1:m); % transform first k remaining rows of T 
T(i:k,k) = [w;lambda]; % new kth column of T 


end 


Here, QR_algorithm(A) is the function from Program 15. 
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Shift Strategies and Convergence Results 


21.10 A look at §21.3 suggests a candidate of the shift jy, in step Ag +> Ag4, of 
the QR algorithm: the Rayleigh quotient 


/ 
He= CmAkem = Ak- 


We refer to this as Rayleigh shift. It specifies line 6 in Program 15 to complete as: 


> mu = A(m,m); 


Using the notation from §21.3, the following convergence of the residual can be 
shown (for a proof see §E.7-E.13): if the initial residual ||ro||2 is sufficiently small, 
it holds 

O(|lr«1I3) in general, 


IIrk+1ll2 = 
O(|\r«||3) for normal matrices. 


Such a convergence is called locally quadratic and locally cubic respectively.*! 


Exercise. Consider an iteration with quadratic (cubic) convergence of the backward error 
being applied to a well-conditioned problem. Show that the number of correct digits will, 
eventually, at least double (triple) in every step. 


21.11 Since for real matrices A the QR algorithm with Rayleigh shift generates a 
sequence A; of real matrices, complex eigenvalues cannot be directly approximated. 
Other than in (21.2), the sequence A; would then converge to a limit of the form 


* * 


a p 
y A 
According to Theorem 20.8, such a convergence does not only occur when the 


lower right 2 x 2-matrix has conjugate-complex eigenvalues but also when both 
of its eigenvalues are symmetrical to the Rayleigh shift » = A. 


Example. Because of RQ = A, the permutation matrix 


with the eigenvalues A = +1 is a fixed point of the QR algorithm with shift = 0. 


81“Iocal”: for starting values in the vicinity of a limit; “global”: for all admissible starting values 
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21.12 In 1968, J. Wilkinson proposed the following shift strategy to address this 
problem: given 


AS a (21.3) 
Ye Ak 


one selects as shift 4 the eigenvalue of the 2 x 2 matrix®* 


( a Bk 
Ve Ak 


that is closest to the Rayleigh shift A;; we refer to this as the Wilkinson shift. In 
Program 15 line 6 is therefore completed as: 


5 mu = eig(A(m-1:m,m-i:m)); [~,ind] = min(abs(mu-A(m,m))); mu = mu(ind); 


As with the Rayleigh shift, the convergence is locally quadratic in general or, 
for normal matrices, locally cubic (for a proof see §E.14). In practice, one needs 
approximately 2-4 iterations per eigenvalue on average.®8 

Exercise. (a) Find how many iterative steps of the QR algorithm Program 16 needs on 


average per eigenvalue for a random 100 x 100 matrix. (b) Show that for a real self-adjoint 
matrix A, the Wilkinson shift is always real. 


Remark. Unfortunately, even the Wilkinson shift is not completely foolproof—except for the 
extremely important case of real symmetric tridiagonal matrices (cf. Phase 2 in §21.14), 
where global convergence can actually be proven.®4 For instance, the permutation matrix 


0 0 1 
A={1 0 0 
01 0 


is a fixed point of the QR algorithm with Wilkinson shift (which is 1 = 0). Therefore, several 
more fine-tuned shift strategies have been developed such as multi-shifts, exception-shifts, 
etc.. The search for a shift strategy that can be proven to converge for every matrix is still 
an open mathematical problem. 


Cost Reduction 


21.13 Even though only a limited number O(1) of iterations per eigenvalue is 
needed, the QR algorithm would still be too costly for the calculation of a Schur 


®2Figenvalues of 2 x 2 matrices are computed as the solution of a quadratic equation; we refer to 
§§14.1-14.2 on a discussion of the numerical stability thereof. 

83 As a bonus, eigenvalues calculated later in the process benefit from the steps of the QR algorithm 
that have addressed preceding ones. 

847. Wilkinson: Global convergence of tridiagonal QR algorithm with origin shifts, Linear Algebra and 
Appl. 1, 409-420, 1968; a more structural proof can be found in W. Hoffmann, B. N. Parlett: A new 
proof of global convergence for the tridiagonal QL algorithm, SLAM J. Numer. Anal. 15, 929-937, 1978. 
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decomposition if one applied it to A without any preprocessing: 


#flop for deflation = # iterations - # flop for one QR step = O(m?) 
SSeS i" 


Program 15 =O(1) Om) 
total cost == O(m? + (m—1)?+---+1) = O(m'). 
— 


# flop Program 16 


The goal of the rest of this section is to reduce the cost from O(m*) to just O(m?) 
(which is then comparable to the cost of a QR decomposition). 


21.14 To achieve this goal, the calculation of a Schur decomposition is broken 
down into two phases: 


(1) unitary reduction of A with a direct O(m*) algorithm to some “simple” form 
H=Q'AQEH: 
(2) use of the QR algorithm on H € H. 


To this end, the space H of “simple” matrices should be invariant under a QR 
step, that is 

H>HA-pl=QR => H,=ROQ+PIEH. 
Here, computing the QR decomposition and the product RQ should both only 
require O(m7) flop. Recalling §9.11, the upper Hessenberg matrices invite themselves 
to form such a space H: for them the computational cost of a QR step is = 6m. 


21.15 The invariance of H under a QR step of phase (2) follows directly from: 


Lemma. Let H and R denote the matrix spaces of upper Hessenberg and upper trian- 
gular matrices respectively. It then holds 


H-RCH, R-HCH. 
If H = QR is the QR decomposition of H € H as in §9.11, then it holds Q € H. 
Proof. According to §§5.2 and 9.11, the matrix spaces ® and H can be characterized 
with the help of the canonical subspaces V;, = span{e1,...,e,} C IK” by 
RER&RYyCKh} (k=1:m), HEH & Hy CV (kK=1:m-—1). 
Step 1. For R € R and H € H we have HR, RH € H, since 
HR(V) C AV, C Via, RH(Vx) C RViaa C Vest (k=1:m-—1). 


Step 2. If H = QR is the QR decomposition of an invertible H € H, it follows from 
the group structure of R with respect to multiplication (Lemma 5.3) that RT! € R 
and therefore Q = HR™! € H. Since, as shown in §9.11, QR decomposition can 
be made continuous in H, it follows that Q € H also for a general H € H by 
approximating H, — H with a sequence of invertible Hy € H. 
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21.16 As for phase (1), the column-wise unitary reduction of A to an upper 
Hessenberg matrix H is best described by looking at the first column: from 


a|y’ 
A — 
x |B 
using a full QR decomposition x = GUe, of x € C'"—))*1, we get 
a | yU 
1| 1| ¢ 
A = 0 
| u’ (u . | uBU 
=Q =Q1 0 


If we execute the same action on U’BU and repeat the process, we thus finally 
obtain the desired Hessenberg matrix H, column-wise from left to right. 
Exercise. Show that the computational cost of this algorithm is O(m?) if one uses an implicit 


representation of Q (by just storing all factors, e.g., Givens rotations, and not expanding 
their product). 


Exercise. Show that if A is self-adjoint, the preprocessed matrix H is also self-adjoint and 
therefore tridiagonal. What is the cost of the two phases in this case? 


Answer: Phase (1) costs O(m3), whereas Phase (2) costs only O(m7?) flop. 


21.17 MATLAB offers access to LAPACK programs of the algorithms from this 
section (for LAPACK ’GE’ stands for a general Hermitian matrix, ‘SY’ for a real 
symmetrical matrix and ‘HE’ for a complex hermitian matrix; cf. §8.1): 


Problem MATLAB LAPACK 
Hessenberg H = Q’AQ [Q,T] = hess(A) xGEHR/xSYTRD/xHETRD 
Schur T = Q'AQ (Q,T] = schur(A,’complex’) xGEES/xSYEV/xHEEV 
all eigenvalues A;(A) (j = 1: m) lambda = eig(A) xGEEV/xSYEV/xHEEV 


The calculation of just the complete set of eigenvalues is computationally much 
less expensive than a fully fledged Schur decomposition since then there is no 
need to deal with the unitary transformations. 
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Example. Here are a few execution times for random real 2000 x 2000 matrices: 


Command General matrix [s] | Symmetric matrix [s] 
[L,R,p] = lu(A,’vector’); 0.11 0.11 
R = triu(qr(A)); 0.23 0.23 
[Q,R] = qr(A); 0.44 0.44 
H = hess(A); 2.0 0.83 
[Q,H] = hess(A); 2.2 1.0 
lambda = eig(A); 3.8 0.89 
[Q,T] = schur(A,’complex’) ; 13 1.4 


21.18 In the case of normal matrices, the columns of the unitary factor Q of a 
Schur decomposition Q’AQ = T form an orthonormal basis of eigenvectors (see 
§18.6). For general matrices, the MATLAB command [V,D] = eig(A) aims at 
calculating a numerical solution V € C’’*"” of 


AV=VD,  D=diag(Ay,...,Am). 


The columns of V would therefore have to be eigenvectors of A; if A does not, 
however, have a basis of eigenvalues, V must be singular—such a matrix A is 
then called non-diagonalizable. On the computer this manifests itself by returning 
a numerically singular matrix V (cf. §11.10) with 


K2(V) 2 ena 


mach’ 


Remark. Individual select eigenvalues can be best calculated as described in §20.11. 


Exercise. Discuss the result of [V,D] = eig([1 1; 0 1]). 


Exercise. Let T € R”*™ be a symmetric tridiagonal matrix of the form 


a by 


(b; #0). 
bn—1 
bn—1 am 
This exercise develops an efficient algorithm for the calculation of the eigenvalues of T that 
are within a given half open interval [w, B). 
e Show that triangular decomposition without pivoting leads (when feasible) to a 
decomposition of the form T — pl = LDL’, where D = diag(d1,...,dm) is given by 
the recursion 


d, =a -U, dj =a; — pu (j=2:m). (21.4) 
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The indices of inertia of T — pI are then given by 


vi (uw) = #{A € 0(T) 2A 2 pw} = #{d; : dj 2 OF, 

vo(v) = #{A € o(T) :A =p} = #{d; : dj = 0} 
with vo(j1) € {0,1}. Characterize the feasibility of the decomposition in exact arith- 
metic. 


e Show that the calculation of the indices of inertia based on (21.4) is backward stable. 
Using the standard model (12.1), prove that the actually calculated quantities d ; have 
the same signs as the di obtained from a perturbed recursion of the form 


; ; ar 
d, =a -4U, dj =aj—p 7 (j=2:m). 
= 


a 


Find a suitable estimate for the relative error of bj. 


e Show that in machine arithmetic (21.4) makes sense even when dj, = 0 for aj, << m. 
Furthermore, the calculated index v_(j1) would remain invariant (but not v+(y)) if 


one prevented d;, = 0 with a sufficiently small perturbation of the form aj, + €. 
e Show that the count v(«, B) = #(0(T) NM [a, B)) is given by v(a, B) = v_(B) —v_(a). 


e Implement the following bisection algorithm: by means of continued bisection of 
intervals, the starting interval [«, 6) is broken down into sub-intervals [a., 6.) which 
are only however further considered in this process if 


vs = #(0(T) M [as,Bx)) > 0. 
Bisection of a sub-interval is stopped if it is sufficiently accurate: B,. — a, < tol. The 
output of the algorithm is a list of intervals [a,., 8.) with their count of eigenvalues. 
e Estimate the computational cost of the algorithm as a function of m, v(a, 8) und tol. 


e Generalize the algorithm to arbitrary real symmetric matrices A by adding a Phase (1) 
similar to §21.14. Why does the assumption b; 4 0 on T not pose a restriction then? 


Exercise. Let the matrices A,B € R”*™ be symmetric, and B be additionally positive 
definite. Consider the generalized eigenvalue problem of the form 


Ax = ABx, x #0. (21.5) 
The set of all generalized eigenvalues A is labeled o(A, B). 


e Give a formula for the (normwise relative) backward error w of a perturbed eigenpair 
(A, £), ¥ € 0. Here only the matrix A should get perturbed. 


e With respect to perturbations of A measured with the 2-norm, the absolute condition 
number of a simple generalized eigenvalue A (let x be a corresponding eigenvector) 
is given by®5 


x!x 


Kabs(A; A) = x! Bx’ 


85y. Frayssé, V. Toumazou: A note on the normwise perturbation theory for the regular generalized eigenprob- 
lem, Numer. Linear Algebra Appl. 5, 1-10, 1998. 
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From this, deduce the bounds |A| || Al|7! < kabs(A;-A) < ||B~||2 (the lower one only 
in the case of A # 0) and make the following statement more precise: 


A generalized eigenvalue A is ill-conditioned if B is ill-conditioned and |A| is 
relatively large. 


Hint: Without loss of generality we can assume that ||B||) = 1. Why? 


Use w and Kaps(A; A) for an algorithmically simple bound of the absolute forward 
error of A. 


Show that a decomposition of the form B = GG’ transforms the generalized eigen- 
value problem (21.5) to the following equivalent symmetric eigenvalue problem: 


Agz=Az, Ag=G'A(G!/,  z=G'x 40. (21.6) 


Get formulas for such G from both the Cholesky and a Schur decomposition of B. 


Apply the inverse iteration with a shift from §§20.6-20.7 to Ag and transform it 
back so that only the matrices A and B appear, but not the factor G. State a suitable 
convergence result. 


Implement the thus generalized inverse iteration efficiently by modifying Program 14. 
Along with the generalized eigenpair (A, x) the output should include the calculated 
absolute condition number «,p,(A; A) as well as a bound of the absolute forward error 
of the generalized eigenvalue A. 


With the help of the transformation (21.6), write a function eigChol (A,B) and a func- 
tion eigSchur (A,B), which both calculate 7(A, B) based on the MATLAB command 
eig(Ac). How many flop are needed for each of the two variants? 


With the three programs developed in this problem as well as the MATLAB command 
eig(A,B), calculate all the eigenvalues for the following example: 


= {flo 2.0 G07 2.0 4.0 8.03 3.0 8.0 Balls 
= [1.0e-6 0.001 0.002; 0.001 1.000001 2.001; 
0.002 2.001 5.000001]; 


A 
B 


Assess the accuracy of the resulting eigenvalues (stability, number of correct digits) 
and rate the methods. Take the condition number of the problem into account. 


Appendix 


A MATLAB: A Very Short Introduction 


MATLAB (MATrix LABoratory) is proprietary commercial software that is used 
in both industry and academia for numerical simulations, data acquisition and 
data analysis. It offers an elegant interface to matrix based numerical calculations, 
as well as state of the art optimized BLAS libraries (Basic Linear Algebra Sub- 
programs) from processor manufacturers and the high-performance Fortran library 
LAPACK for systems of linear equations, matrix decompositions and eigenvalue 
problems. This all is accessible by means of a simple scripting language. 


General Commands 


A.| Help: 


help command, doc command 


shows the help text of a command in the console or browser. 
A.2 ’,’ and ’;” separate commands, whereby ’;’ suppresses displaying the output. 


A.3 Information: 


whos 


provides information about the variables in memory. 


A.4 Measuring execution time: 


tic, statements, toc 


executes the statements and measures the computation time required. 


A.5 Comments: 


% comments can be written behind a percent sign 
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Matrices 


A.6 MATLAB identifies scalars with 1 x 1 matrices, and column vectors (row 
vectors) of dimension m with m x 1- (1 x m-) matrices. 


A.7 Assignment: 
A = expression; 


assigns the variable A the value of expression. 


A.8 +,-,* are matrix addition, subtraction, multiplication. A/B computes the 
solution X to the linear system A = XB and A\B the solution X to AX = B. 


A.9 .* is the componentwise multiplication of two matrices. 
A.10 A? is the adjoint matrix of A. 


A.| 1 Matrix input: 


A = [ 0.32975 -0.77335 0.12728; 
-0.17728 -0.73666 0.45504]; 


or 
A = [ 0.32975 -0.77335 0.12728; -0.17728 -0.73666 0.45504]; 


assigns the following matrix to the variable A: 


0.32975 —0.77335 0.12728 
—0.17728 —0.73666 0.45504) ° 


A.12 Identity matrix: 
eye (m) 


generates the identity matrix of dimension m. 


A.13 Zero matrix: 
zeros(m,n), zeros(m) 


generates a matrix of zeros with the dimension m x n and m x m respectively. 


A.14 Ones matrix: 
ones(m,n), ones(m) 


generates a matrix of ones with the dimension m x n and m x m respectively. 
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A.I5 Random matrix with with uniformly distributed entries: 
rand(m,n), rand(m) 


generates a matrix of i.i.d. random entries, uniformly distributed on [0,1], with 
the dimension m x n and m x m respectively. 


A.|16 Random matrices with normally distributed entries: 
randn(m,n), randn(m) 


generates a matrix of i.i.d. random entries, having standard normal distribution, 
with the dimension m x n and m x m respectively. 


A.17 Index vectors: 
H}Slss JP REshs 
are the row vectors [j,j+1,...,k-1,k] and [j,j+s,..., j+m*s] with 


m = |(k—j)/s]. 


A.18 Components: 
AQj,k) 


is the element of the matrix A in the jth row and kth column. 


A.19 Submatrices: 
A(j1:j2,k1:k2) 


is the submatrix of the matrix A, in which the row indices are j1 to j2 and the 
column indices are ki to k2. 


A(j1:j2,:), AC:,k1:k2) 


are the submatrices built from rows j1 to j2 and from the columns k1 to k2. 


A.20 Matrix as vector: 
v = AC:) 


stacks the columns of A € KK”%*” consecutively on top of one another to make a 
column vector v € K””. 


A.21 Closing index: 

x(end) 

is the last component of the vector x and 
A(end,:), A(:,end) 


is the last row and column respectively of the matrix A. 
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A.22 Triangular matrices: 
tril(A), triu(A) 


extracts the lower and upper triangular part of the matrix A and fills up with zeros. 


A.23 Dimension of a matrix: 

size(A,1) bzw. size(A,2) 

outputs the number of rows and columns of the matrix A respectively; 
size(A) 


returns the row vector [number of rows,number of columns]. 


A.24 Dimension of a vector: 
length (x) 


returns the dimension of the column or row vector x. 


Functions 


A.25 The definition of a MATLAB function myfunc begins with the line 
function [o_1,0_2,...,o_n] = myfunc(i_1,i_2,...,i_m) 


where i_1,i_2,...,i_m represent the input variables and 0_1,0_2,...,o_n rep- 
resent the output variables. The function has to be saved as the file myfunc.m in 
the project directory. The function is called in the command line (or in further 
function definitions) by: 


KoRiom2 rr, Onn! es my-funic| Gide Dhyne y em) 


Output variables can be ignored by omitting the last variables or by replacing 
them with the place holder ’~’: 
[~,o_2] = myfunc(i_1,i_2,...,i_m) 


If there is only one output variable, one can omit the square brackets: 


o_1 = myfunc(i_1,i_2,...,i_m) 


A.26 Function handle: 


f = @myfunc; 
LO 6O Boo oo psa] Se Glas 8), 5 5 a gat_in)) 


A function handle can be created with the ’@’ symbol and assigned to a variable 
that then acts like the function itself. 


Sec. A] MATLAB: A Very Short Introduction 103 


A.27 Anonymous function: 
f = @(i_1,i_2,...,i_m) ezpression; 


generates a handle for a function that assigns the value of the expression to 
i_1,i_2,...,i_m. The call is executed by 


a SS eG SL yh Fs 55 pak i) 


Control Flow 


A.28 Conditional branches: 


if expression 
instructions 
elseif expression 
instructions 
else 
instructions 
end 


The instructions are executed if the real part of all entries of the (matrix valued) 
expression is non-zero; ‘elseif’ and else’ are optional and multiple ‘elseif’ 
can be used. 


A.29 for loops: 


for Variable = vector 
instructions 
end 


executes instructions multiple times where the variable is successively as- 
signed the values of the components of vector. 


A.30 while loops: 


while expression 
instructions 
end 


executes instructions repeatedly as long as the real part of all entries of the 
(matrix valued) expressions is non-zero. 


A.3| ‘break’ stops the execution of a for or a while loop. 


A.32 ‘continue’ jumps to the next instance of a for or a while loop. 
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A.33 ‘return’ leaves the current function by returning control to the invoking 
function. 


Logic Functions 
A.34 Comparisons: 
A == B, A -= B, A <= B, A>=B, A<B, A>B 


componentwise comparisons of matrices; 1 stands for ‘true’ and 0 for ‘false’. 


A.35 Test for “all” or “at least one” true element: 
all(A < B), any(A < B) 


tests whether all comparisons, or at least one, is true; equivalently valid for the 
other operators ==, ~=, <=, >=, >. 


A.36 Logic Boolean operations: 


a && b, a || b 


are the logical ‘and’ and ‘or’ for scalar values a and b. Every non-zero value is 
understood as true, and the value zero as ‘false’; the result is 0 or 1. 


A.37 Negation: 
~expression, not (expression) 


both return the logical negation of expression. 


A.38 Test for an empty matrix: 
isempty (A) 


is 1 if the matrix A is empty, otherwise 0. 


Componentwise Operations 


A.39 Componentwise (“point-wise”) multiplication, division and power: 


A.*B, A./B, A.~2, A.7B 


A.40 Componentwise functions: 


abs(A), sqrt(A), sin(A), cos(A) 
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B Julia: A Modern Alternative to MATLAB 


B.I Since 2012, the highly performant, expressive and dynamical programming 
language Julia (open source with the MIT/GPL license) has been developed by 
a group lead by the mathematician Alan Edelman at MIT. Julia, by means of its 
modern design, is well suited for both the requirements of scientific computing, 
as well as more generalized programming tasks and includes elements such as 


e JIT-compilation using LLVM,°6 

e parametric polymorphism and type inference at compile-time, 
e object oriented programming based on multiple dispatch, 

e homoiconicity and ,hygienic’ macro programming, 

e the ability to directly call C- and Fortran libraries. 


This way Julia combines the expressiveness of programming languages like MAT- 
LAB and Python with the performance of programming languages like C, C++ 
and Fortran. For a more complete account and introduction, I strongly recommend 
reading the following paper: 


J. Bezanson, A. Edelman, S. Karpinski, V. B. Shah: Julia: A Fresh Ap- 
proach to Numerical Computing, SLAM Review 59, 65-98, 2017. 


An international and extremely active developer community has augmented 
the functionality of Julia by more than 1500 packages, pushing Julia into the 
ranks of one of the fastest growing programming languages ever. By means of 
its modern concept, high performance, and free accessibility, Julia has already 
become a highly attractive alternative to MATLAB and may very well overtake in 
the university setting in the near future. 

In order to aid the commendable transition, I have listed all MATLAB programs 
of this book below in Julia 0.6 and will go into the differences and advantages for 
each of these examples.®” 


B.2 Julia’s language elements for numerical linear algebra are basically very 
closely related to MATLAB. I refer you to the 650-pages reference manual and the 
help function for details: 


? command 


86“T ow Level Virtual Machine”, a modular compiler infrastructure architecture that is also the basis 
of the highly optimized C compiler Clang. 

87In order to easily test Julia, the Julia environment juliabox.com allows one to use Julia directly in a 
browser without local installation. 
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One small syntactical difference is the use of square brackets instead of round 
brackets for matrix indices: 


MC atals Allee ntalsrei, AEs. sl, Ailes sretere), Ale 


B.3 In contrast to MATLAB, vectors and column vectors (as well as 1 x 1 matri- 
ces and scalars) are not identified in Julia, but rather treated as fundamentally 
different types (type differences lead to distinctively compiled programs and a 
great increase in Julia’s execution speed, since types no longer must be identified 
at run-time). When working with column vectors, this phenomenon is barely no- 
ticeable, since matrix-vector products are compatible with matrix-column-vector 
products. When working with row vectors however, one must take care, since both 
of the expressions 


UNIES sts] 4 AAIES) 6 2d] 
create a vector from the corresponding components of the columns and rows of 


the matrix A. Explicit generation of a column or row vector requires the following 
(MATLAB compatible) syntax: 


IMLS otesell, ANILG Sa} el 


Hence, the table from §3.2 changes in Julia as follows: 


meaning formula Julia 

components of x Ck x [k] 

components of A Mik ACj,k] 

column vectors of A ak AL:,k] and AL: ,k:k] 

row vectors of A a Alj:j.:] 

submatrix of A (ik) j=m:p,k=n:l Alm:p,n:1] 

adjunct matrix of AA’ Ww? 

matrix product AB A*B 

identity matrix Te Rm", cmxm eye(m), complex (eye (m) ) 
zero matrix 0€R™", cunt zeros(m,n), complex(...) 


B.4 One significant semantic difference between MATLAB and Julia is the passing 
of matrices for assignments or function calls. Julia actually passes references to an 
object (“call by reference”) so that in the following example, the variables A and B 
both refer to the same matrix in memory: 


Se No [il Me @ Alls 
>> B= A; 


me JS (il, ill] = sale 


>> A1i,1] 


5-1 
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In comparison, MATLAB passes copies of the values (“call by value”) and, for this 
example, would return the original value of 1 for A(1,1). In Julia, this behavior 
can be achieved by creating an explicit copy: 


B = copy (A) 


B.5 As with MATLAB, Julia also creates a copy for “slices” such as A[: ,k:k] (so 
here a copy of a m x 1 submatrix), but also offers direct views of A in memory with 
the commands view(A,:,k:k) foram X 1 dimensional matrix and view(A,:,k) 
for an m dimensional vector. Alternately, one can use the @view macro: 


x = A[:,k] # the vector x is a copy of the kth colunn of A 
y = @view A[:,k] # the vector y is identical with the kth column of A in memory 


These views do not only save space in memory, but often also execution time by 
avoiding unnecessary copy operations. 


B.6 Matrix multiplication 
Remark. For K = C, the first line of code must be C = complex(zeros(m,p)). 


Program | (Matrix Product: column-wise). 


C = zeros(m,p) 
t@se JLab B79) 

@(le, i) = Ae [Pe .1l]] 
end 


Program 2 (Matrix Product: row-wise). 


C = zeros(m,p) 


2 Ore Seem 


Cli saesd S Al So 5 sil eas 
end 


Since Julia calls the BLAS routine xGEMM for matrix-matrix multiplication to 
execute the matrix-column-vector product in line 3, this program be accelerated 
through the explicit use of the xGEMV BLAS routine (notice that the order of factors 
requires the transposition flag ’T’ to be set):°§ 


Program 2 (Matrix Product: row-wise, accelerated version). 


C = zeros(m,p) 


2 Orme! 


Clj,:] = BLAS.gemv(’T’,B,A[j,:]) 
end 


88Julia offers a very user friendly interface to BLAS and LAPACK where a routine named ‘xyz’ can be 
called by BLAS.xyz and LAPACK. xyz. 


nN 


nN 


Nv 


Nv 
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Program 3 (Matrix Product: inner products). 


C = zeros(m,p) 
ne@e 9) Sab gam 
nieye Alals ye) 
Gees ockll S ANG Aa p BAECS, aly) 
end 
end 


This version can be considerably accelerated when Julia explicitly uses a BLAS-1 
routine and does not create a copy of the row and column vectors: 


Program 3 (Matrix Product: inner product, accelerated version). 


C = zeros(m,p) 
see ibaa 
ioe dhsil Sys) 
C[j,1] = dot(conj(view(A,j,:)),view(B,:,1)) 
end 
end 


Remark. Notice that the dot command takes complex conjugates of the first factor. 
Program 4 (Matrix Product: outer product). 


C = zeros(m,p) 
for k=1:n 

C += AC: ,k)*B(k:k,:] 
end 


Since Julia calls a matrix-matrix multiplication for computing the outer product, 
the program can be accelerated with the suitable Level-2 BLAS routine xGER: 


Program 4 (Matrix Product: outer products, accelerated version for real matrices). 


C = zeros(m,p) 
for k=1:n 

BLAS. ger!(1.0,A(:,k],B[k,:],C) # in situ call to xGER 
end 


Remark. Julia uses an ’!’ at the end of a function name when the routine runs in situ and 
thereby at least one argument is overwritten. In this way, BLAS. ger! (alpha,x,y,A) executes 
the rank-1 update A + axy! — A right in place where A is stored in memory. 


Exercise. Modify this program so that it works correctly for complex matrices, too. 
Program 5 (Matrix Product: componentwise). 


C = zeros(m,p) 
ose WIL Bin 
gee dail 249) 
for k=i:n 
CCj,1] += ALj,k]*B(k,1] 
end 
end 
end 
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Thanks to the extremely effective JIT-compiler (that creates native machine code 
directly before the first execution of the program), componentwise multiplication 
is much faster in Julia than in MATLAB, and is almost as fast as in C. The full 
performance of C can be achieved by using the compiler directive @inbounds, 
which tells Julia that the indices will stay in bounds so that it can safely pass on 
admissibility checks: 


Program 5 (Matrix product: componentwise, accelerated version). 


C = zeros(m,p) 


2 for j=i:m 


se@ue” desl 2 js) 
for k=i:n 
@inbounds C[j,1] += ALj,k]*B[k,1] 
end 
end 
end 


The execution times show that Julia is especially faster than MATLAB when 
loops are involved, in which case the full performance of a compiled language 
can be exploited. Below, Table 3.4 has been amended: 


program BLAS level MATLAB [s] C& BLAS [s] Julia [s] 
A*B 3 0.031 0.029 0.030 
column-wise 2 0.47 0.45 0.48 
row-wise 2 0.52 0.49 1.2 | 0.52 
outer product 2 3.5 0.75 6.53 | 0.75 
inner product 1 17 15 8.5 | 1.8 
componentwise 0 20 1.6 3.1 | 1.6 


The minimal discrepancy between C and Julia can be accounted for by the use 
of different BLAS libraries: MKL from Intel is used in the C implementation (as 
well as by MATLAB) and OpenBLAS by Julia. 


B.7 Execution times can either be measured with the @time macro, 


@time instructions 


or similar to MATLAB with 

tic(); instructions; toc() 

More accurate measurements can be taken with the package BenchmarkTools, 
which automatically calculates an average of multiple runs if the execution is fast. 


The number of local processor threads used for a possibly parallel execution of 
BLAS routines can be easily controlled with the following command: 


BLAS.set_num_threads (number of coras) 
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B.8 Forward substitution to solve Lx = b: 
Program 6. 


x = zeros(b) # zero vector of the same size and type (real/complex) as b 


2 for k=1:m 


x[k:k] = (b[k:k] - L[k:k,1:k-1]*x[1:k-1])/LI[k,k] 
end 


Remark. Notice the consistent “double” use of the index k in the form k:k. This is because 
in Julia the product of row and column vectors 


L[k:k,1:k-1]*x[1:k-1] 


returns a result of the type vector (in K'), whereas the components b[k], x[k] are scalars. 
Julia does not provide a “type promotion” functionality that would allow the assignment 
of vectors in K! or matrices in K!*! to scalars IK. We must therefore write b[k:k] and 
x[k:k] in order to ensure the correct type is matched. Alternatively, the dot command can 
be used, but for K = C the complex conjugate in the first factor must be taken into account: 


x = zeros(b) 
for k=i:m 

x(k] = (b[k] - dot(conj(L[k,1:k-1]) ,x[1:k-1]))/L[k,k] 
end 


For in situ execution, the first variant can be written as: 


Program 7 (Forward Substitution for x < L~!x). 


for k=1:m 
x(k:k] -= L[k:k,1:k-1]*x[1:k-1] 
x[k] /= LI[k,k] 

end 


In Julia, the command to solve a triangular system of equations such as Lx = b 
or Ux = b can also be written succinctly (and exactly as in MATLAB, Julia thereby 
analyzes the structure of the matrix): 


x = LNb, x = UND 


B.9 If we encode a permutation 7 € S,, in Julia by p = [7(1),...,7(m)], row and 
column permutations P/.A and AP, can be expressed as: 


Mos 8ll5 AILS ajol 

The permutation matrix P, itself can be obtained as follows: 

E = eye(m), P = E[:,p] 

In Julia, the symbol I is reserved for a universal identity matrix (see the reconstruc- 


tion of the L-factor in §B.10 and the program §B.23) and should therefore never 
be overwritten. 
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B.10 In situ triangular decomposition without pivoting can be written as: 


Program 8 (Triangular Decomposition). 


for k=1:m 
ACk+1i:m,k] /= A[k,k] ia (foals) 
ACk+1:m,k+1i:m]) -= A[k+1:m,k]*A[k:k,k+1i:m] ca (if oile) 
end 


The factors L and U are then reconstructed with the commands: 


5 i= tri sais 7 


U = triu(A) 


Here, copies of the data from matrix A are created and stored as full matrices (with 
the other triangle full of zeros). It is more effective to use the direct view of the 
memory of A in the form of a data structure for (unipotent) triangular matrices 
that can then nevertheless be used like standard matrices for further calculations: 


5 L = LinAlg.UnitLowerTriangular (A) 


Now 


U = LinAlg.UpperTriangular (A) 


When pivoting is added, the code can be written as follows: 


Program 9 (Triangular Decomposition with Partial Pivoting). 


p = collect (1:m) # initialization of the permutation vector 

form i= lesan! 
j = indmax(abs(A[k:m,k])); j = k-1+j # pivot search 
p((k,j]] = p(lj,k]]; ACCk,j],:] = ACCJ,k],:] # row swap 
A[k+i:m,k] /= A[k,k] # (7.2d) 
ACk+1:m,k+1i:m] -= A[k+1:m,k]*A[k:k,k+1i:m] ca (Ufo2ey) 

end 


The triangular factors of the matrix A[p,:] are exactly reconstructed as above. 
Peak performance can be reached by means of the Julia interface to xGETRF: 


le, Wy jo = Wad) 


B.I1 In Julia, the solution X € K"”*" of the system AX = B € K"*" is therefore: 


li, Wy js S TIGA) 
A = i Ws [py 9 1 


3 Xe meUINZ 


or completely equivalently, if the decomposition P’A = LU is no longer needed: 
X = A\B 

Alternatively, Julia offers the compact data structure Base .LinAlg.LU to store the 
factorization: 


= lufact(A) # lufact(A,Val{false}) prevents pivoting 


F 
L = F(:L]; U = F0:U]; p = F[:p] 
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This way a system of the form AX = B can be solved without re-factoring of the 
matrix A brief and succinctly as follows: 


X = F\B 


B.12 In Julia, the program for the Cholesky decomposition can be written as: 
Program 10 (Cholesky Decomposition of A). 


L = zeros(A) 


2 for k=1:m 
ik = Li kai tek =t) SA Lek =1, 8) # (8.1b) 
Lik, tc k=i) = 1k? # (8.1a) 
L(Uk:k,k] = sqrt(A[k:k,k] - 1k’*1lk) # (8.1c) 
end 


Exercise. Write an in situ version that converts the lower triangular matrix of A into the 
lower triangular matrix of L. Hint: Use LinAlg.LowerTriangular, cf. §B.10. 


Peak performance can be reached with the interface to xPOTRF: 
L = chol(A,Val{:L}) 
The factor U = L’ is calculated slightly faster if a column-wise storage scheme is 
used for matrices on the machine at hand: 
U = chol(A) 
If a linear system of the form AX = B is to be solved, the following approach 
should be taken (this way the factorization can also be easily re-used):*? 


F = cholfact (A) 
X = F\B # solution of the system of equations 
L, U = F[E:L], F[:U] # L factor and, alternatively, U=L’ factor 


B.13 The MGS algorithm of QR decomposition can be written in situ as follows: 
Program |! (QR Decomposition with MGS Algorithm). 


R = zeros(n,n) # for K=C: R = complex(zeros(n,n)) 


2 for k=1:n 


R(k,k] = norm(A[:,k]) 

A[:,k] /= R[k,k] 

R(k:k,k+i:n]) = A[:,k]’*A[:,k+1:n] 

AL:,k+1:n] -= A[:,k]*R[k:k,k+1:n] 
end 


The calculation of an (often not normalized) QR decomposition with the House- 
holder method can be compactly completed with the following LAPACK interface: 


Q, R = qr(A, thin=true) 


8°x = A\B actually does not use the Cholesky decomposition of A, see the remark in §B.19. 
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Unlike MATLAB, Julia allows access to (and convenient use of) the implicit 
representation” of the matrix Q as mentioned in §9.8: 


F = qrfact(A) # faster by a factor of 2 


> Q = F[:Q] # of the type Base.LinAlg.QRCompactWYQ 


R = F[:R] # read the factor R 


The object oriented nature of Julia allows for matrix-vector multiplication Q*x etc. 
even when Q is of the type Base .LinAlg.QRCompactWYQ , the details are completely 
hidden from the user. The unitary factor Q can be retroactively represented as an 
ordinary matrix if needed: 


5 Q = full(F[:Q], thin=true) # needs more cpu time than qrfact(A) itself 


The sum of the execution times from lines 2 and 5 is exactly equal to that of the 
command from line 1. 


B.14 Error Analysis 1 from §§14.1—14.2: quadratic equation 


>> p = 400000.; q = 1.234567890123456; 
22 r= eqrttp 2Fq)) x0 = p = xr 
-1.543201506137848e-6 

Pees ae A) = 0) oP) 


5 1.23455810546875 


1 


m0 


Jeo ceil = jo) cP es 


>> x0 = -q/xl 
-1.5432098626513432e-6 


B.15 Error Analysis 2 from §§14.3-14.4: evaluation of log(1 + x) 


>> x = 1.234567890123456e-10; 
>> w = 1+x; f = log(w) 


3 1.2345680033069662e-10 


>> w-1 


5 1.234568003383174e-10 


5 >> £ = log(w)/(w-1) *x 


1.234567890047248e-10 
>> logip(x) 
1.234567890047248e-10 


B.16 Error Analysis 3 from §§14.5-14.6: sample variance 


>> x = [10000000.0; 10000000.1; 10000000.2]; m = length(x); 
>> 92) = Csum Gc 92) = sums) 5 2)/m))/\ Cm) 


3 -0.03125 


>> xbar = mean(x); 


5 >> S2 = sum((x-xbar) .~2)/(m-1) 


More specifically, both Julia and LAPACK use the compact WY representation of Q, see R. Schreiber, 
C. F. Van Loan: A storage-efficient WY representation for products of Householder transformations. SAM 
J. Sci. Stat. Comput. 10, 53-57, 1989. 
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0.009999999925494194 

>> var (x) 

0.009999999925494194 

>> kappa = 2*dot(abs(x-xbar) ,abs(x))/S2/(m-1) 
2.000000027450581e8 


B.17 Julia offers an interface to LAPACK’s condition number estimate xGECON: 


cond(A,1) # estimate of (A) 
cond(A,Inf) # estimate of Ko(A) 


Here, a previous triangular decomposition of the matrix A can be elegantly and 
profitably re-used: 
F = lufact(A) # triangular decomposition of A 


cond(F,1) # estimate of 1,(A) 
cond (F, Inf) # estimate of Koo(A) 


B.18 The Wilkinson example of an instability of triangular decomposition with 
partial pivoting as discussed in §15.10 can be written in Julia as: 


>> m = 25; 
>> A = 2I-tril(ones(m,m)); AL:,m] = 1; # Wilkinson matrix 
3 >> b = randn(MersenneTwister (123) ,m); # reproducible right hand side 
>> F = lufact(A); # triangular decomposition with partial pivoting 
>> x = F\b; # substitutions 
Pees fe) Te joy INE be D # residual 
>> omega = norm(r,Inf)/(norm(A,Inf)*norm(x,Inf)) # backward error (15.1) 


2.6960071380884023e-11 


Let us compare once more with the result using the QR decomposition: 


>> F_qr = qrfact(A); # QR decomposition 

>> x_qr = F_qr\b; # substitutions 

oe se Ce = |) = MNase Cres # residual 

>> omega_qr = norm(r_qr,Inf)/(norm(A,Inf)*norm(x_qr,Inf)) # back. error 
3. 7.7342267691589ele-17 

>> cond(F, Inf) # estimate of the condition number K.(A) 
5 25.0 


5 => norm(x-x_qr,Inf)/norm(x, Inf) 


4.848325731432268e-10 


This accuracy, however, is consistent with the forward error estimate (15.3): 
Koo(A)w(%) 25 x 2.7-107! = 6.8-107). 


As described in §15.13, just one step of iterative refinement repairs the instability 
of the triangular decomposition with partial pivoting. Notice that the previously 
calculated triangular decomposition F can elegantly be re-used: 
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9 >> w = F\r; 


eee be Ee be Ge ye oe ES joy = J \esbed 
>> omega = norm(r,Inf)/(norm(A, Inf) *norm(x,Inf)) 
7.031115244689928e-18 


3 ee norm (s-x_qr,int)/norm({x, Int) 


1.582000930055234e-15 


As a comparison: the unavoidable error of the solution is Ko0(A) - €mach © 2° 10-1, 


B.19 The Lauchli example from §16.5 can be written in Julia as: 


SS O = We=%9 ff e=i07 
>> A = [1 1; e 0; 0 e]; b = [2; e; e]; # Lauchli example 

; >> F = cholfact(A’A); # Cholesky decomposition of the Gram matrix 
>> x = F\(A’b); # solution of normal equation 

rao joreating 6 (C"  a52)) 8 


a6 


1.01123595505618 0.9887640449438204 


Remark. In contrast, the direct solution of the normal equation using the \ command, without 
explicitly using Cholesky decomposition, accidentally”! yields the “correct” solution: 
>> x = (A’A)\C(A’D); # solution of the normal equation 


S252 jorpalinig o (YW 456) 8 
1.0000000000000002 1.0 


Hence, a comparison with §16.5 shows that the \ command for self-adjoint matrices behaves 
differently in Julia than in MATLAB: 


e In the case of self-adjoint matrices, Julia always calls the LAPACK routine xSYTRF, 
which uses a symmetrical variant of triangular decomposition with pivoting, namely 
in the form of the Bunch—Kaufman—Parlett decomposition 


P/AP=LDL'. 


Here, the permutation matrix P represents diagonal pivoting, L is lower triangular, 
and D is a block-diagonal matrix comprised of 1 x 1 or 2 x 2 blocks. As it were, this 
method has the same complexity of m? /3 flop as the Cholesky decomposition, but 
for s.p.d. matrices it is less amenable to optimized BLAS and therefore slower due to 
the overhead of pivoting. 


e In contrast, in the case of self-adjoint matrices with positive diagonals, MATLAB 
attempts to see whether a Cholesky decomposition can be successfully calculated 
(thereby checking whether the matrix is s.p.d.); otherwise, as in Julia, xSYTRF is used. 


Since the positive definiteness of a given self-adjoint matrix is most often known based on 
the structure of the underlying problem, and Cholesky decompositions are only used in a 
targeted manner, the approach taken by Julia is more efficient for the rest of the cases (this 
way avoiding wasted trial Cholesky decompositions). 


°\Notice that the information loss in A ++ A’ A is irreparably unstable. Nevertheless, we cannot exclude 
that a particular decomposition method accidentally gives a correct solution in a specific example 
where one would reliably rather require a different, stable algorithmic approach. In the Lauchli 
example, the choice of the smaller value e = 1e-8 would yield a numerically singular matrix AA 
which causes problems no matter what decomposition of A’A is used. 
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B.20 The solution of least squares problems discussed in §17.2 reads in Julia as: 


Program 12 (Q-Free Solution of the Least Squares Problem). 


R qrfact([A b])[:R] # R factor of the matrix A augmented by D 
x = R[i:n,i:n]\R[1i:n,n+1] # solution of Rxr=z 


3 rho = abs(R[nt+1,n+1]) # norm of the residual 


or simply x = A\b. The numerical example from §16.5 is then written as follows: 


Se oe = Vs 
p52 jonealinng 6 (™ Ware) g 
0.9999999999999996 1.0000000000000004 


B.2| The example from §18.3 cannot be directly copied into Julia, since the 
developers of Julia rightly do not see the point in providing unstable and inefficient 
routines to calculate eigenvalues as roots of the characteristic polynomial. We can, 
however, convey the essence of this example with the help of the Polynomials 
package: 

>> using Polynomials 

>> chi = poly(1.0:22.0) # polynomial with the roots 1,2,...,22 

>> lambda = roots(chi) # roots of the monomial representation of x 

>> lambda[7:8] 

2-element Array{Complex{Float64},1}: 


15.5374+1.15386im 
15.5374-1.15386im 


The condition number (absolute with respect to the result and relative with respect 
to the data) can be computed as: 


>> lambda = 15.0 

>> chi_abs = Poly(abs.(coeffs(chi))) 

>> kappa = polyval(chi_abs ,lambda)/abs.(polyval (polyder (chi) , lambda) ) 
5.711809187852336e16 


As in §18.3, we get «(15) - Emach © 6. 


B.22 The power iteration from §20.3 is in Julia: 


Program 13 (Power Iteration). 


normA = vecnorm(A) # store ||Allr 
omega = Inf # initialize the backward error 
3 w = Axv # initialize w= Av 

while omega > tol # backward error still too large? 
v = w/norm(w) # new Vv 
w= Atv # new Ww 
mu = dot(v,w) # Rayleigh quotient 
r= WwW > mu*v # residual 
omega = norm(r)/normA # backward error estimate W 


end 
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B.23 In Julia, the inverse iteration with shift from §20.7 can elegantly be written 
as follows (notice the clever re-use of the triangular decomposition F in line 5): 


Program 14 (Inverse Iteration). 


normA = vecnorm(A) # store ||Allr 
omega = Inf # initialize the backward error 
3 F = lufact(A - muShift*I) # store the triangular decomposition of A— pl 
while omega > tol # backward error still too large? 
w = F\v # solve the linear system of equations 
normW = norm(w) # ||wll2 
z = v/normW # auxiliary quantity Zz 
v = w/normW # new Vv 
rho = dot(v,z) # auxiliary quantity 0 
mu = muShift + rho # Rayleigh quotient 
a Zee OLA, # residual 
omega = norm(r)/normA # backward error estimate Ww 
3 lend 


B.24 In §20.11 we explained that usually just one step of inverse iteration suffices 
to compute an eigenvector corresponding to a given backward stable eigenvalue. 
In Julia, the illustrative example can be written as follows: 


>> m = 1000; rng = MersenneTwister (123) # initializaton 
>> lambda = collect(i:m) + collect(-5:m-6)*1.0im # given eigenvalues 

3 >> Q,= qr(randn(rng,m,m)); A = Q*diagm(lambda)*Q’ # suitable normal matrix 
>> mu = 1.0 - 5.0im # shift=known eigenval. 
>> v = randn(rng,m); v = v/norm(v) # random start vector 
>> w= (A - mu*I)\v # 1 step inv. iteration 
>> omega = sqrt (m)*norm(A*w-mu*w)/vecnorm(A)/norm(w) # Eq. (20.1) 


1.6091332673393636e-15 


B.25 The program from §21.7 for a QR-algorithm based deflation is in Julia: 
Program 15 (QR Algorithm). 


function QR_algorithm(A) 


m, = size(A); U =I initialize 

normA = vecnorm(A) store |/A||r 

while norm(A[m,1:m-1])>eps()*normA backward error still too large? 
mu = shift pz, (see §§21.10/21.12) 


Q, R = qr(A - mu*I) QR decomposition 


# HH H HH H H 


A = R*Q + mux*I RQ multiplication 
U = U*Q transformation U;, = Q):+:Q, 
end 
lambda = Al[m,m] # eigenvalue 
B = A[1:m-1,1:m-1] # deflated matrix 
w = A[1i:m-i,m] # last column in Schur decomposition 


return lambda, U, B, w 
end 
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In program line 5 the Rayleigh shift from §21.10 is simply mu = Alm,m]; whereas 
the Wilkinson shift from §21.12 is coded as: 


5 mu, = eig(A[m-1:m,m-1i:m]); ind = indmin(abs(mu-A[m,m])); mu = mu[ind] 


3 Pork =m Sie 


N 


A Schur decomposition is finally calculated as in §21.9, though in Julia the 
variables T and Q should be initialized as complex matrices: 


Program 16 (Schur Decomposition). 


T = complex (zeros (m,m)) 
Q complex (eye (m) ) 


Inatialazatvonst rcs 


- as complex matrices 
column-wise from back to front 
QR-algorithm based deflation 
transform first k columns of q 
transform first k remaining rows of T 
new kth column in T 


lambda, U, A, w = QR_algorithm(A) 
Ms ptetel = @ [Lede tele 
t[1i:k,k+i:m]) = u’*t[i:k,k+1:m] 
T[1i:k,k] = [w; lambda] 

end 


# # HH HH HH H OH 


B.26 We summarize the Julia commands for the algorithms from Chapter V. 


e Hessenberg decomposition H = Q/AQ: 


1 F = hessfact (A) 
= 19 (Es eh 
= F[:H] 


wo N 
=z oO 


Here, Q has an implicit representation of type Base .LinAlg.HessenbergQ (cf. the first 
exercise in §21.16), the fully fledged explicit matrix can be obtained with 


Q = full(Q) 


e Schur decomposition T = Q'AQ: 


1 T, Q, lambda = schur(complex(A)) # ist version 
2 F = schurfact (complex (A) ) # 2nd version 
3 T = F[:T]; Q = F[:Z]; lambda = F[: values] # extraction of factors 


The diagonal of T, i.e., the eigenvalues of A, can be found in the vector lambda. 


e If one is only interested in calculating the eigenvalues of A, which is less 
costly than a Schur decomposition since then there is no need to deal with 
the unitary transformations, one simply calls the following command: 


lambda = ecigvals(A) 
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C Norms: Recap and Supplement 


C.1 || - || : V — [0,00) is a norm on the vector space V, if for v,w € V,A € K: 
(1) ||/v|| =0<v=0 (definiteness ) 
(2) |[Av|| =|A]- |||] (absolute homogeneity) 
(3) |jv+ wl] < |lo|| + ||w|| (subadditivity or triangle inequality) 

We call norms on K" vector norms and those on K"™*" matrix norms. 


C.2 Numerical analysis uses the following vector norms for x = (¢)) jelim © KK": 


m 
e “taxicab” norm ||x\|1 = Ye ICjlz 
j=l 


mL 1/2 
e Euclidean norm (cf. §2.9)  ||x|]2 = ( ai = Vx'x; 
i=1 


J 
e maximum norm — ||x||oo = maxj=1:m |¢j\- 


Exercise. Draw the “unit sphere” {x : ||x|| <1} for these three norms in R?. 


C.3 Holder's inequality and the Cauchy—Schwarz inequality are valid, i.e., 
Ix'yl < [lx[la-Mlylleo, ey] <I lla-llyll2 = (@y € KK"). 
Besides, the || - ||2-norm is invariant under a column orthonormal Q € K™*": 


|| Qx||3 = (Qx)'(Qx) = x’ O'Qx =x'x = ||x||3 = (x © K"). 
ire 


C.4 Given a matrix A € K”*" written in components, columns and rows as 


My +++) Xin | | —a,— 
A= : : — a: coe Qe = : 7 (C.1) 
m1 +** mn | | — a —_ 
we define the Frobenius norm (also known as the Schur of Hilbert—Schmidt norm) 
m n 2 1/2 m 2 1/2 n ky? 1/2 
Alle = | Do do Leja =(diliglz) =( Lillia) - 
j=lk=1 j=l k=1 


Thanks to |jq;||3 = aaj and |\a*\|5 = (a*)’a‘, the last two expressions can also be 


written in terms of the trace of the Gramian matrices AA’ and A’ A: 


|| All? = tr(AA’) = tr(A’A). 
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Since the trace of a matrix is defined as both the sum of the diagonal elements as 
well as the sum of the eigenvalues, it holds?” 


m n 
Allg = DA(AA) = YAR (A‘A). 


C.5 The Frobenius norm has the following properties: 
e |lJ||e = \/tr(J) = Vm for the identity matrix I €¢ K”*"; 
e submultiplicativity (follows from (2.3b) and the Cauchy—Schwarz inequality) 


|A-Blle <||Alle-||Blle (Ae K"™™", Bek"), 


e invariance under column orthonormal Q as well as adjunction 


Q 
A 


/ / 1/2 / 
|Q4lle = (tr(4’Q'QA)) = [IAlle, —A'Ile = Alle. 
=I 


C.6 A vector norm induces a matrix norm for A € K™%*” in accordance with 


|| Ax| 
IIx 


b 
|| Al| = max || Ax]| @ max || Ax]| ) max 
\[x||<1 \|x||=1 x40 


Remark. Here, (a) and (b) follow from the homogeneity of the vector norm; the maximum 
is attained since {x € KK": ||x|| <1} is compact and x ++ ||Ax|| is continuous. 


For the identity matrix I € K”*™, ||I|| = 1 is always valid; therefore, the 
Frobenius norm is not an induced matrix norm for m,n > 1. 


C.7 From the definition, it directly follows for induced matrix norms that 
[Axl] < [|All [lx (@ € K") 
and therefore the submultiplicativity 
|ABl| < Al] 1B) (Ae kK", Bek" ?). 
Remark. A vector norm on K” always matches its induced matrix norm on K"*1. we can 


therefore continue to identify vectors as single column matrices and vice versa. The first 
inequality is then just the case p = 1 of general submultiplicativity. 


921 (M),...,Ap(M) denote the eigenvalues of the matrix M € KK?*? in order of algebraic multiplicity. 
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C.8 Numerical analysis uses the following induced matrix norms for A € K”*": 
e max-column-sum norm 


(a) 
||All1 = max || Ax|], = max ||a*|[1; 
\|x|]1<1 k=1:n 


@ max-row-sum norm 


b 
Alles = max, ||Axoo max lll 
Ilx\]o0<1 jain 
e spectral norm 
©) M2 a) = 
|All2 = max ||Ax||2 = (max aj(44')) = (max ax(4’A) : 


\|x|]2<1 k=1:n 


The invariance of the Euclidean norm under column orthonormal Q directly 
induces” the invariance 


|QAll2 = ||Allz, therefore notably ||Qllz = ||Q!ll2 = |llz=1% (C2) 


both render column orthonormal matrices so important numerically. 


Concerning adjunction, it directly follows from (a)-(d) that 


|A%ll1 = [lAlloo, A’ Ilo = IA, (A’ll2 = [Alle (C.3) 
Remark. The calculation of the matrix norms || - |1, || - [oo and || - ||p is fairly simple, but 
that of the spectral norm || - ||, is computationally far more costly by comparison: an 


eigenvalue problem would have to be solved. Since the same set of invariances is satisfied, 


to 


the Frobenius norm often serves as a “poor man’s” replacement for the spectral norm. 
Exercise. Show that ||xy’||2 = ||x|l2||y||z2 (as used in §§11.9, 15.2 and 21.7.) 


C.9 Real analysis has taught us that all norms are equivalent for a finite dimensional 
vector space V: for || - || and || - ||q, there exists a constant cpyq > 0 such that 


olla  (vEV). 


Illp < epg 


Such a Cp, does however depend on V; therefore on the dimensions for vector 
and matrix norms. For the matrix norms from §§C.4 and C.8 (and accordingly 
with n = 1 for the vector norms from §C.2) we display the best possible values of 
the constant cy,, in the following table, where r = min(m, 1): 


NM] 12 © &§ 
1 1 Vm m Jm 
Daft °C fm 4 
co}/n fn 1 Yn 
F\J/n vr Vm 1 


8 Alternatively, one once again uses (QA)/(QA) = A'Q'/QA = A’A. 
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Exercise. Show that the table is valid. Hint: Consider n = 1 first and make use of A’. 


C.10 The matrix |A| € K”*" represents the componentwise application of the 
absolute value function on A € K"*"; we read inequalities written as 


|A| < |B (A,B € k"™*") 
componentwise (as introduced in §§7.9/7.11). Matrix norms that satisfy 
ALI] = [Al] and |A] < |B] = |All < [1B 


are called absolute and monotone respectively. 


Example. The norms || - ||, || - |lco, || - || are both absolute and monotone. The same 
can be said of the spectral norm || - ||2 in the case of vectors only (m = 1 or n = 1). 


Remark. For finite dimensional norms, it generally holds: absolute + monotone. 


C.I1 It holds for diagonal matrices that 


|| diag(x)||p 2 IIxllo (p € {1.2,00}),  IIdiag(x)|le = [Iz (CA) 


Remark. For induced matrix norms, (a) is equivalent to the monotonicity (absoluteness) of 
the inducing vector norm. 


Exercise. Prove the equivalences stated in the remarks in §§C.10 and C.11. 


Exercise. Show that for normal matrices A (cf. 18.6) there holds ||A||2 = p(A). Here, 
p(A) = max |o(A)| 


denotes the spectral radius of A. 


Exercise. Let P € K”™*" be a projection” of rank r. Show that 
© ||P > 1; 
e ||P\|2 = 1 if and only if P is an orthogonal projection; 
e if0 <r <m, it holds ||P\|2 = ||I — Pll2. 


Hint: Construct a unitary U, such that 


IS 


1 = 
u'Pu = (5 0 


and deduce that ||P||2 = ,/1+ ||S||. 


°4 A general projection P is defined by P? = P, and is additionally orthogonal when P’ = P. 


) Se Kr (m-r) 
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D The Householder Method for QR Decomposition 


D.1| We want to calculate the full QR decomposition of a matrix A € R”*" with 
full column rank in a column-wise manner. With a unitary Q1, the first step is 


| | P1 * 
| decstia | x 


and, thereafter, the matrix A; is used for the subsequent calculations. 


Exercise. Thoroughly describe the algorithmic procedure used this way for the calculation 
of Q’A = R. Show how Q can be obtained as the product of the unitary matrices from the 
individual steps. 


D.2 We are thus left with the problem of how to directly calculate a full QR 
decomposition of a column vector 0 4 a € R”: 


Ola= (+) = pe}. 


D.3 To this end, we construct Q’ as a reflection at a hyperplane, which we describe 
by a vector v # 0 perpendicular to the hyperplane. Such a reflection is then applied 
to the vector x € IR” by flipping the sign of its components in the direction v: 


Hence, it is nothing more than that the vector 22%u being subtracted from x, 


/ 


Ovex oe" 


2. 

/ / 

v Q =I1- —v0. 
vo’ vv 


Reflections of this form are called Householder reflections in numerical analysis. 
Exercise. Show: Q* = I, Q' = Q, detQ = —1, Qis unitary. 


D.4 We must now find a vector v such that Q’a = per, Le., 


/ 

v 
a—2—v = pe}. 

vv p 


Since a reflection (as with every unitary matrix) is isometric, we get in particular 


lel = llall2, v € span{a — pe}. 
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Since the length of v cancels itself out in Q’, we can finally set 


t||a||2, v=a-—pey. 


p= 


We then choose the sign such that v is calculated, in any case, in a cancellation free 
manner (writing a = (;)j=1:m and 0 = (wy) j=1:m): 


p = —sign(a1)||all2, w1 =a, —p =sign(ay) (lar|+ |p|), oj =a; (7 > 1). 
Hence, it holds 
o'v = (a — pes)! (a — per) = ||a||3 — 2pay + p> = 26° — 2pay = 2\p| (|p| + ||) 
and therefore in short (we know that Q’ = Q) 
Q=1-ww!,  w=o/y/lpl(|ol+ ler), [lela = v2. 


Exercise. Generalize this construction to the complex case a € C”. 


D.5 The calculation of the decomposition Q’a = pe, therefore has computational 
cost #flop = 3m, as long as only w and ¢ are calculated and Q is not explicitly 
stored as a matrix. The application of Q to a vector x, ie., 


Qx =x-(w'x)w, 


accordingly has computational cost # flop = 4m. 


Exercise. Show that the calculation of a full QR decomposition of a matrix A € R™*" 
(m > n) using Householder reflections has computational cost (again as long as Q is not 
explicitly constructed, but rather only the vectors w of the Householder factors are stored) 


n—1 
#flop = 4 Yi (m k)\(n—k)= 2mn? — 2n>/3. 
k=0 


n—-1 
The evaluation of Qx or Q'y each have cost # flop = 4 Ye (m — k) = 4mn — 2n?. 
k=0 
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E For the Curious, the Connoisseur, and the Capable 
Model Backwards Analysis of Iterative Refinement 
E.1 Skeel’s Theorem 15.13 can be nicely explained using a simplified model:”° 
Only the factorization step P'A = LU is calculated with error, so that®® 
P(A+E)=L-U, — ||E|| = O(7(A)emach)IlAll- 
Hence, subsequent numerical solutions ti of Au = 0 fulfill (A+ E)ii =v. 


In this model, one step of iterative refinement applied to Ax = b yields a numerical 
solution x, satisfying 


x9 = 2%, 1t9 =b— Axo = Exo, (A+E)bd=109, xy =x +0; 
the residual thereof is r; = b — Ax, = 19 — A = Ew. We calculate 
Aw = E(xo — ®) = E(x, —20), EW = EA~1Ex, —2EA™ ‘Ed; 
using norms results in the residual bound 
lIrall = EO] < JE|PA~* I [lanl] + 2 EI A || Eo] 
= O(7(A)?«(A) |All Il llemacn) + O(7(A)«(A) emach) I71I- 
Now, if (A)*«(A)€mach = O(1) with 7(A) > 1, we get 7(A)«(A)emach < 1. 


Thus, x; has an error in the model that already ensures backward stability: 


= O(7(A)?K(A) each) = O(€mach)- 


mach 


Remark. For well conditioned matrices with x(A) = O(1),a single step of iterative refinement 
can thus compensate a growth factor of as large as 


(A) = O(€ R38), 


mach 


which corresponds to an initial loss of accuracy of up to half the mantissa length (cf. the 
discussion in §15.12 and the numerical example in §§15.10/15.13). 


Exercise. Construct an example to show that for (A) > ez) ch even multiple steps of 
iterative refinement do not necessarily lead to an improvement in backward error. 


°*5For a componentwise study of this model, see F. Bornemann: A model for understanding numerical 
stability, IMA J. Numer. Anal. 27, 219-231, 2007. 
%6CF£. §§15.8 and 15.9. 
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Global Convergence of the QR Algorithm without Shifts 


E.2 Even though the convergence of the QR algorithm without shifts and deflation 
is much too slow to be of any practical importance, J. Wilkinson’s classic and 
elegant convergence proof? 7 is still very interesting for mathematical theory: 


1. It explains how eigenvalues arrange themselves on the diagonal of the 
triangular factor of the Schur decomposition obtained in the limit. 


2. It relates the QR algorithm to the power iteration: the unitary factor U; of 
the QR decomposition of A‘ becomes asymptotically the similarity transfor- 
mation of A to triangular form. 


3. It shows that in course of the QR algorithm with (convergent) shift, not only 
does the last row of Ax converge but the others are already “prepared”. 


E.3 The proof is based on a triangular decomposition of largely theoretical interest, 
which goes by applying a particular column pivoting strategy”® to A € GL(m;K): 
we will use the first zero element as pivot and will not merely transpose it into 
position, but rather cyclically permute all intermediate rows down one row. In the 
notation” from §7.9 it can easily be shown by induction (left as an exercise) that 
along with L, its transformation 

P,LyPy. 


is also a unipotent lower triangular. We therefore obtain a decomposition of the 
form P!.A = LR, where L = P,LP", is also a unipotent lower triangular; it holds 


A= LPR. 


Such a decomposition is called a (modified) Bruhat decomposition. 


Exercise. Show that the permutation 7t of a Bruhat decomposition A = LP; R is uniquely 
determined by the matrix A € GL(m;K), but not the triangular factors L and R. These 
become unique by requiring that L = PLP, is also lower triangular. 


E.4 We assume that the eigenvalues of A € GL(m;C) differ in absolute value 
(and are thus real if A is real), so that they can be ordered as follows: 


|A4| > |A2| >> |Am| 0. (E.1a) 


97}, Wilkinson: The Algebraic Eigenvalue Problem, Clarendon Press, Oxford, 1965, pp. 516-520; the 
presentation here follows roughly the argument from E. Tyrtyshnikov: Matrix Bruhat decompositions 
with a remark on the QR (GR) algorithm, Linear Algebra Appl. 250, 61-68, 1997. 

*8Wilkinson loc. cit. p. 519. 

In which here, of course, T% represents the cycle (123 --- r,) instead of the transposition (11), 
when the pivot a; is found at row number r; of the matrix Ax. 
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In particular, there exists a corresponding basis of eigenvectors (which are real for 
real A); if we use this basis as columns of the matrix X € C”™%*", it holds 


X1AX=D, D=diag(A1,...,Am). (E.1b) 

Wilkinson’s elegant idea consists of applying the Bruhat decomposition! to X71, 

X-! =LP;R, (E.1c) 

where L is unipotent lower triangular and R upper triangular. With triangular 

factors at hand, the specific order of the eigenvalues can be conveniently exploited 
in the following form: we generally have, for any lower triangular L,., that 

D'L,D* = diag(L.). (kK =}-00). (*) 


This is because with @ = max;s4 |Aj/Ax| < 1 the elements [5 of L. satisfy 


O(6) +0 ifp>q, 


Ap * * c 
(5) bg = ) 'pp a 


0 otherwise. 


E.5 Starting with Ap = A, the QR algorithm constructs the sequence 
Ap =QkRe, — Agyr = ReQe = (kK =0,1,2,...). 
It holds Ay = U; AU, and Ak has a QR decomposition of the form 
AN = UySk, Up = Qo? Qk-t, Se = Re Ro. 
For k = 0, both of the claims are clear, and the induction steps from k to k + 1 are 
Agi = QArQk = QYUpAUQ« = Ue AUK +4 
as well as 


AM} = AAK = AUS, = Ug(UpAUg) Sp = UpAgS = Ur(QeRe) Sk = Uns 1 Sk41- 


10Tn the “generic” case, there are no zero divisions to be avoided, which means that generally the 
eigenvalues will get sorted: 7c = id. In the exceptional cases, however, the realm of rounding errors 
will lead in course of the iteration, by backward analysis, to increasingly perturbed A and, a fortiori, 
to increasingly perturbed X. The corresponding 7t therefore deviates less and less (in discrete 
jumps, now and then) from id and a numerical “self sorting” takes place in the end. 
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E.6 Using P = P, for short, we obtain from A = XDX~! and X—! = LPR that 
(p= A's," SXD"(EPRS,* 
and therefore from A; = U,AU, (notice that Ui, = U,') 


A= SRP LD Xk ARID LPRS, = SER P(E “DLPRS, 
_ 


=D 


In this way, we get Ay = W, |B. Wx with B, = P/D‘(L-'DL)D~*P and the upper 
triangular matrices 


W, = (P’D‘P)RS~! = P'(D‘L-!D-*) Dk x-1s571 = p’(DkL-'D~*)x-U,.. 
k k k 


The last equality shows that the sequences W, and W,- 1 are bounded: by (*), there 
holds DKL~!D-* — diag(L~!) = I; the U; are unitary. Besides, it follows from (*) 


B, + P’ diag(L~'DL)P = P'DP=D,, Dz = diag(Ax7(1), a, np Aneten)* 
In summary, we obtain the asymptotic result 
Ac = W,'DrWe + Wy '(Be—- Dr)We 
— _—— 
upper triangular 0 


and therefore Wilkinson's convergence theorem: the assumptions (E.1) imply 


diag(A,) > Dz, strict lower triangle of A; — 0. 


Remark. By compactness, a selection of subsequences yields U;, — Q within the space of 
unitary matrices and W;, — W within the space of non-singular upper triangular matrices. 
The limit of Ay, = Uj AUk, = We 1D7W, + (1) then gives the Schur decomposition 


Ax(1) Kore * 


QAQ=T, T=W '!D,W= 


m(m) 
Exercise. 
e By means of an example, show that the sequence Ax, itself does generally not converge. 


e Show that a selection of subsequences can be avoided if the sequence A; is replaced 
by a sequence of the form LD Abe with suitable unitary diagonal matrices Xx. 


e By constructing specific matrices A, show that every 7t € Sj can be attained (in theory 
only; see Footnote 100 for what to expect in actual numerical experiments). 
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Local Convergence of the QR Algorithm with Shifts 


E.7 Our goal here is to understand, and to partly prove (under somewhat stronger 
conditions), the claims about convergence from §§21.10-21.12. To this end, we 
partition the matrices of the QR algorithm with shift as follows: 


hate By — pil | Wr _ Py | Ug Sk | Sk 
re | Ae Bk M | Mk 0 | px 
=Q& =Ry 
eae Bupa — Hel | We44 _ Sk | Sk Pr | Ug . 
They | Aka — Hk 0 | Pk O% | Mk 
=Ry =Q 


where, without loss of generality, the QR decomposition is chosen such that px > 0. 


E.8 We aim at estimating ||7,4,||2 in terms of ||r,||2. Via multiplication, we obtain 
Th =USkr Tey = Pky 
and thus, with writing 0; = ||S;'||> for short, the initial estimate 


\|De|l2 < ellrell2, 7x2 ll2 < ex%llrell2- 


We must therefore estimate o; and px. 


E.9 Since Q; is unitary, it holds by the normalization of the last column and row 
1 = |lugll3 + lel? = [orl + lel, 
therefore in particular 
Mello = lloxll2, — Iel <1. 
It furthermore also follows from the unitarity that 


P. | Ok By — pl | Wk Sk | Sk 


uf, | Hf vk | Ak He 7 0 | 
and therefore 
Pk = Up We + MAR Hk), Pk < ||Uellallevell2 + [Ak — wel: 
For the Rayleigh shift uj, = Ax, the estimates obtained so far imply 


Pk <o«llellallrelle, — UMresalle < oe llvellallrelld- 
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E.10 Asa sub-column of A, the vector w;, remains bounded by 
I|well2 < | Axll2 = ||All2- 


If A and therefore also A, are normal, it follows via the expansion of A, A, = A, Ax 
that the (2,2) element of the partitioning satisfies 


IIrell5 + Ag? = [lel + Ag ?, thus |[vell2 = [Irell2- 
To summarize, we arrive at the following statement about the Rayleigh shift: 


oe llrell if A is normal, 
II*ka1ll2 < (E.2) 
oF \|Allollrell3 otherwise. 


We are thus finished when we give reasons that oj actually remains bounded. 


E.11 Due to By — ppl = PS, and ||Pr|]2 < ||Qx|]2 = 1 there is 


OK = IISq "lla < (Be — MeL) “Ila = sep (Hi, Be) 


If the shift 4; converges as ||r;||2 + 0 towards a simple eigenvalue A of A, it must 
be the case that B, converges towards a deflated matrix A), where A is no longer 
contained in the spectrum, so that 


On < sep(Hg, By)! 4 sep(A, Ay) 1 < ©. 


Hence, if we assumed convergence as such, we would obtain from (E.2) the 
asserted quadratic or cubic speed of convergence.!7! 


E.12 Without already assuming convergence, the boundedness of 0; can be shown 
through perturbation theory. For the sake of simplicity, we will restrict ourselves 
to the case of a self-adjoint matrix A with pairwise different eigenvalues. Let the 
minimum mutual distance of two eigenvalues be 3d > 0. Notice that along with A 
also the matrices A; and B, are self-adjoint and that furthermore 7(A) = o(A,). 
Hence, upon writing 


By | rk By | rk 


Ap = ; 
" 


as / 
Ak | Uk re | Ak — Hk 
———~-—< 
=F =Ex 
with the perturbation bound (using the Rayleigh-Shift) 


Ell3 < Exile = 2llrelld + [Ae — mal? = 2llrell3, 


101Up to here our argument has essentially followed the discussion in G. W. Stewart: Afternotes goes to 
Graduate School, SLAM, Philadelphia, 1997, pp. 139-141. 
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Corollary 19.3 implies under the assumption ||r,||2 < 6/2 that for every A € o(A) 
there is exactly one ps € o(F,) with |A — p| < 6. Therefore, the eigenvalues of Fy 
are pairwise different, too, and the distance between any two of them is at least 5 
(why?), so that in particular 


dist(Hx, (Fx) \ {ux }) = 6. 
(Bx) 
=0( By 


By Theorem 19.3, we finally obtain the bound 


Ok < sep(}, Be)! = dist(jx,0(By))* < 6". 


E.13 We can therefore state the local convergence result as follows: 


Theorem. Let A be self-adjoint with pairwise different eigenvalues that have a mini- 
mum mutual distance of 35 > 0. If ||ro|l2 < 6/2, then as k + 00 the QR algorithm 
with Rayleigh shift yields cubic convergence of the form 


lIre+alle < 9" [Irall2 — 0. 


Proof. It remains to be shown by induction that ||7,||2 < 6/2. For k = 0, this is 
exactly the assumption on 19; the induction step from k to k +1 is 


lIrealle < ofllrell3 < 977 lirall3 < Sllrall < [lrell2 < 6/ v2. 
Furthermore, the intermediate bound ||r,+4||2 < ||r,||2/2 inductively proves 


lIrell2<2*|lroll2 +0 (kK 00), 


which finally ensures convergence itself. 


E.14 The Wilkinson shift 1; satisfies by §21.12 (with the notation used there) 


(Ak — Uk) (&k — Be) = Be Yk, [Ak — Hel < lek — pele 


Here, the last inequality holds since the solutions of the quadratic equation are 
also symmetric relative to the midpoint (A; + a,)/2 of ag and A. Hence, we get 


[Ak — Hal? < [Bevel < |lwellallrell2 


and, therefore, the bound 


1/2 


3/2 
lIresalle < op lamella rill + 7% level’ relia’. 


Now, in the self-adjoint case, there holds ||E;||2 < V3||rgll2 and thus o, < 6~1 if 
\Irx2 < 5/3. Altogether, completely similar to the proof for the Rayleigh shift, 
we obtain the following theorem: 
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Theorem. Let A be self-adjoint with pairwise different eigenvalues that have a mini- 
mum mutual distance of 35 > 0. If ||ro|lz < 6/W3, then as k — 0 the QR algorithm 
with Wilkinson shift yields quadratic convergence of the form 


IIretalle < 2671 [rg ||5 — 0. 


Remark. In the realm of the Wilkinson shift, the speed of convergence can be improved to 
that of the Rayleigh shift if there ig!02 


T =Ap—ap3OT#O. 


Actually, from 


(He — Ag)? + Te(He — Ak) = Bev BRE! < Wvellalrelle, 


one then gets the substantially improved bound 


[He — Agel = O(t~"||ewellellrell2), 


so that given a bounded o; there holds, just as in §F.9 for the Rayleigh-Shift, that 
Pe =O(|lwellallrelle), — MMre+alla = O(lerellallrell3).- 

A Stochastic Upper Bound of the Spectral Norm 

Disclaimer: this section is only suitable for more advanced readers. 


E.15 Given A € C”*" and 0 4 x € C” it holds by definition that 


In §20.11 we used the fact that for random vectors x a converse bound holds with 
high probability, too: 


Theorem. Let A € C”*” be fixed and x € IR™ be a random vector with independent 
standard normal distributed components. It then holds with probability > 1 — 6 that 


|All2 < o1 Vm 


Ate esc). 
2 


IIx 


Proof. From the singular value decomposition of A (with U, V unitary), that is, 


A= Udiag init )V |All =o, >--- > Om > 0, 


12This holds, e.g., in the case that A; — A € o(A) while the eigenvalues of A — AI differ in absolute 
value; cf. Theorem E.2. 
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as well as the unitary invariance of the spectral norm and of the multivariate 
normal distribution it follows with independent standard normal distributed 
random variables ¢1,...,¢m 


A gta... 4+g2 
p (aE <cl4lz) =P (PGS ee a69 
Isl Gt 4+e 


Now, for m > 2, the random variable R* = é7/(@] + @ +--+ +2.) is of the form 
X/(X+Y) 


with independent chi-squared distributed random variables X ~ xe and Y ~ oan 
it is therefore beta distributed with parameters (1/2, (m—1)/2) and, hence, has 
the distribution function F(t) = P(R* < t) with density! 


1 
~ B(1/2,(m—1)/2) 


F'(t) Pyne ee O< Fei), 


Thus, the probability distribution 
G(6) = P(R<5/\/m) =F(S*/m) (0<d6< Vm) 
has, for m > 2, the density 


26 


G'(5) = —F'(8°/m) = 


am-1/2 gar (-9)/2 
B(1/2,(m—1)/2) ( ) : 


From Stirling’s formula follows the monotone limit 


2m-1/2 — yy tl? T(m/2) , 2 
B(1/2, (m—1)/2) T(1/2)0 ((m —1)/2) 7 


104 


and hence, for m > 3, the upper bounds 


a < 2, a) < f25<6 (0<d5< Vm). 


103 All of this can be found in §9.2 of H.-O. Georgii: Stochastics: Introduction to Probability and Statistics, 
3rd ed., Walter de Gruyter, Berlin, 2008. 
10The first two bounds are asymptotically sharp for 6 > 0 small, since as m —> co 


2\ (m—3)/2 : 
(1 = -) +e, Gl) fae Pe (0<5 <0). 


m 
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For m = 2 it holds after integration 
G(6) = = aresin(6/V2) <5 (0<d< v2) 


and for m = 1 the trivial approximation G(5) = 0 < 6, where 0 < 6 < 1. Therefore, 


IlxIl2 vm vm 


which finally proves the assertion. 


Remark. As the proof shows, for m > 3 the bound of the theorem can be improved by yet a 


factor of 
\/ 2 0.79788, 
A 


but this is then asymptotically sharp in the case 01 = «+: = 0m =0 asm —+ co and 6 > 0. 
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F More Exercises 


The student is advised to do practical exercises. 
Nothing but the repeated application of the 
methods will give him the whole grasp of the 
subject. For it is not sufficient to understand the 
underlying ideas, it is also necessary to acquire a 
certain facility in applying them. You might as well 
try to learn piano playing only by attending 
concerts as to learn the [numerical] methods only 
through lectures. 


(Carl Runge 1909) 


There are 62 exercises already scattered throughout the book; here, 36 more will be added. Better yet, 
there are numerous other ones to be found in the “classics” listed on page viii which I would highly 
recommend to all the readers for supplementary training. 


Computer Matrices 


Exercise. Alternatively to §5.5, forward substitution can also be achieved with the following 
recursive partitioning of the lower triangular matrix L, = L: 


Ak 
Ly = 
Ik | Lega 


Based on this, write a MATLAB program for the in situ execution of forward substitution. 


Exercise. Let A € K”*"™ be given with coefficients a jx. The goal is to calculate 


Gj = Loa = (f=1:m). 


e Program the componentwise calculation using two for-loops. 

e Program the calculation without for-loops but rather with matrix-vector operations. 
e Compare the execution time for a real random matrix with m = 10000. 

e What is the name of the corresponding custom-made BLAS routine? 


Hint. Read help tril; help triu; help ones; and www.netlib.org/lapack/explore-html/. 


Exercise. Let u,v € K™ and A=I+uv! € K™*™, 
e Under which condition (on u and 7v) is A invertible? 
e Find a short formula for A~! and det A. 


Hint. First, consider the specific case v = e; using a clever partitioning. Transform the general case to 
this specific case. 
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Matrix Factorization 
Exercise. Given A € GL(m;K) and D € K"*", consider the block-partitioned matrix 
A B 
m=(¢ 5): 


e Find the block-LU decomposition of the form (I is the identity) 


m=(ee )( vJE 1): 


e Give a simple factorization of M7! and characterize its existence. 
e Write M~! in the same way as M in form of a 2 x 2-block matrix. 


e Compare the results with the textbook case m =n = 1. 


Exercise. Consider the linear system of equations Ax = b for the matrix 


e Permute the rows and columns of A in such a way that the storage cost for the factors 
of an LU decomposition grows just linearly in m. Give an explicit form of L and U. 


e Solve the system of equations with the help of the thus obtained LU decomposition. 
What is the condition for solvability? 

e Write a MATLAB program without for-loops which finds the solution x given u, v 
and b. How many flop are required (in leading order)? 


Exercise. Identify what the program 
1 [m,n] = size(B); 

2 Borenstein 

Ore j/=1me male al 
1 EC oa) = Cy) pal VIC a 5 a0) 8 
5 fore k=1e ia! 

B(k,i) = B(k,i) - B(j,i)*A(k,j); 

7 end 
8 end 
9 end 


does given the input A € K”*™, B € K™*". Replace it with a short command without the 
use of for-loops. 


Hint. First, replace two of the for-loops with matrix-vector operations and then study the inverse 
operation that reconstructs the input matrix (stored in B) given the output matrix B. 


Exercise. Let A € K”*" be invertible, b € K” and n € N. Describe an efficient algorithm 
for the solution of the linear system of equations 


A™x =b. 


How many flop are required (in leading order) for n < m? 
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Exercise. Let a tridiagonal matrix 


Pm 
Am-1 om 


be given, which is column-wise strictly diagonally dominant: 
li] > |Aal, [8 > Ag] + lojl G=2:m—-1), — |dm| > lem. 


e Show that A has an LU decomposition with a unipotent lower bidiagonal factor L 
and a upper bidiagonal triangular factor U, such that U agrees with A above the 
diagonal. Why is there no need for pivoting? 


e Formulate this decomposition as a MATLAB program, which receives the vectors 
d = (6;)j=19m €K™, 1 = (Aj)jotom—1 EK" 1, 1 = (p;)j=22m € K™ 


as input. Pay attention to an efficient use of memory. How many flop are required 
(in leading order) by your program? 


The method developed in this exercise is referred to in the literature as Thomas algorithm. 


Exercise. The sweep operator 7; acts on a matrix A € K"™™" as follows: 


Ax=y => TAK=FH 


where x = (€1,.--,Cx-1 Gk Sepa 08m) Y= (Meee Ma Mee Mpa Mm)’, 
B= (C1r--- ka Mee Segre Sm) P= (Meee Meas Ske Mega)’ 
Find a clever formula for 7,A. Under which condition on A is 7; well-defined? Show (while 
explicitly stating a simple condition that guarantees the operations to be well-defined): 
e 7; and 7, commutate for 1 < j,k < m. 
e Sweep operators are of order 4, so that in particular J, ! = 7;°. 
e For a self-adjoint A, the matrix 7;.A is also self-adjoint. 


e If we partition A such that A is the k x k principal submatrix, it holds 


Au | A -Az | Ay Ai 
one _ ~ 7 
Ax | Az An Aj; | Az — An Aj, A12 
The block Azz — An Ay Ai is called the Schur complement of Ay, in A. 


e Itholds Jj ---TmA=—A7!. 


Though sweep operators are a popular “workhorse” in computational statistics, 
potentially numerically unstable when used without pivoting (example?). 


105 they are 


05See K. Lange: Numerical Analysis for Statisticians, 2nd ed., Springer-Verlag, New York, 2010. 
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Exercise. Let x1,...,Xn € IR” be a basis of the space U. Write a MATLAB two-liner that 
calculates an orthonormal basis of the orthogonal complement of U. 


Exercise. Given A € K"*" where n < m, let a compactly stored full QR decomposition 
with F = qrfact (A) be pre-calculated in Julia. Only afterwards let a vector a € K™ be given. 


e Write, as a Julia function QRUpdateR(F , a), an efficient algorithm to calculate only the 
R-factor of a reduced QR decomposition of the matrix 


A, = (Ala) € K"™ 04), 


Give reasons why the function performs as desired. 


Hint. The scaling or the addition/subtraction of vectors should both be categorically avoided. 


e Estimate how many flop QRUpdateR(F,a) saves when compared to a re-calculation 
from scratch. Perform some benchmarking measurements. 


Hint. Recall that the matrix-vector products F[:Q]*x and F[:Q]’*y only require O(mm) flop. 


Error Analysis 


Exercise. Consider for 0 < € < 1 the linear system of equations Ax = b where 


1 1 0 
HG). 0) 
e Show that xoo(A) = 2+2e7! >> 1. Find a perturbation A + E with ||E|lo < €, such 
that the relative error in x with respect to the || - ||.o-norm exceeds 100%. 


e Show that the linear system of equations is well conditioned for relative error mea- 
sures when only the input e€ is perturbed. 


Exercise. Calculate (with respect to componentwise relative errors) the condition number 
x(det, A) of the map A € IR2*2 +5 det A. Characterize the ill-conditioned case. 


Exercise. Let e(€) be the relative distance from 0 < ¢ € Fg, to the next nearest machine 
number. Plot e(@) in double logarithmic scale. Locate the machine precision €mach- 


Exercise. MATLAB calculates: 


1 >> format long e 
2 >> exp(pi*sqrt (67)/6) - sqrt (5280) 
3 ans = 

6.121204876308184e-08 


How many decimal places are correct in this results (by “clever inspection” only, i.e., by 
mere counting and without the use of a computer or calculator)? What would one have to 
do in order to calculate all 16 decimal places correctly? 
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Exercise. Here we will practice to turn simple expressions into numerically stable ones: 


e For which values of ¢ are the following expressions unstable: 
VE+1-Vé Fe [0,cf;  (1—cosg)/e*, Fe (-m,7)? 
e Find stable expressions. Compare for ¢ = 2, j = 48:53, and ¢ = 107/, jH4a:9. 


Exercise. Here we will further practice to turn simple expressions numerically stable: 


¢ Calculate the expression (1+1/n)" for n = 10", k = 7:17. What had you expected 
(from calculus)? Explain the observed , numerical limit “. 


e Find a stable expression for (1+1/n)". 


Exercise. For x > 0, the expression log (Vi+2 = 1) defines a function f(x). 
e For which values of x is f ill-conditioned with respect to the relative error? 
e For which values of x is the given expression a numerically unstable algorithm? 
e Find a stable expression. 


Hint. You may use the fact that 1 < xf'(x) <2 for x > 0. 


Exercise. Let one be interested to calculate a root of the cubic polynomial 
43qx—-2r=0 (q,r>0). 


According to Cardano (1545) a real root is given by 


x= r+ P+? i/ rt yet. 


This formula is numerically unstable and, based on two cubic roots, also quite expensive. 


e Explain why Cardano’s formula is, for r << q and for r > q, potentially unstable. 


e Find a numerically stable formula for x9 that allows the calculation with just one 
cubic and one square root. 


e Show that the root x9 is well-conditioned; it actually holds that «(x9;q,1r) < 2. 


e For the cases of r < q andr > q, give numerical examples in which Cardano’s 
formula looses at least half of the mantissa length in accuracy. 


Exercise. MATLAB calculates: 


| >> rng (333) ; % for reproducibility 
2 >> A = randn(4000); 
3 >> det (A) 
1 ans = 
Sint 


Compute the determinant of the matrix A without under- or overflow. How many decimal 
places of the solution are then correct? 
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Exercise. Consider the complex square root Vz forz = x4 iy (x,yeE R,y £0). 


e Write an efficient, numerically stable program that only uses arithmetic operations 
and real square roots. Be sure to avoid over- and underflow; the program should 
work, e.g., for the following input: 


x = 1.23456789 - 107°, sy = 9.87654321 - 10-2, 


e Give simple estimates of the condition numbers of the following maps: 


(x,y) Re(vz), (x,y) > Im(y2). 


Exercise. Let the system of linear equations 
Hx=b, b=(1,...,1) eR", 


be given with the Hilbert matrix H = (hjx) € R1°*1, hip = 1/(j +k —1) for j,k = 1:10. 
Hint. H can be generated with the MATLAB command hilb. 


e Calculate a numerical solution ~ using LU decomposition with partial pivoting. 
e Calculate the normalized relative backward error w. Is < backward stable? 


e Estimate the condition number x..(H) with the help of the MATLAB command 
condest and compare the bound Ko(H) - w of the relative forward error of % with 


its actual value 7 
||x — £|loo 


\|£ [loo 


The exact solution x € Z!9 can be determined with the MATLAB command invhilb. 


Exercise. The LAPACK manual warns of the following “professional blunder”: 


Do not attempt to solve a system of equations Ax = b by first computing A~! and 
then forming the matrix-vector product x = A~‘b. 


This exercise develops an explanation. 


e Search online to find how many flop this “blunder” would require, when Al 
is computed with the LAPACK program xGETRI (which is behind the MATLAB 
command inv)? Compare the result with the standard method x=A\b. 


e With the help of Criterion B from §13.8, characterize for which A the backward stability 
of the “blunder” is at risk. 


e Construct a reproducible example (in MATLAB) where the “blunder” produces a very 
large backward error but the standard method results in one within the magnitude 
of machine precision. Estimate and assess the forward error, too. 


Hint. Use normwise relative errors with respect to the || - ||. norm. An “exact” x would be taboo. 
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Exercise. A matrix A € K”*" is called row-wise diagonally dominant, if 


m 
aj) > Yo lal (j= 1:m). 
k=1 
kAj 
It is called column-wise diagonally dominant, if A! is row-wise diagonally dominant. 


e Show that if A ¢ GL(m;K) is column-wise diagonally dominant, triangular decom- 
position with partial pivoting does not actually perform any row swaps. 


e Assess the suggestion to turn off partial pivoting for row-wise diagonally dominant 
matrices A € GL(m;K). 


Hint. Compare the growth factors of A and A’. 


Least Squares 


Exercise. Consider the following three models (with random noise e; and parameters 6) 
for measurements b = (f1,...,Bm)' € IR™ and t = (t,...,Tm)' € R™: 


(i) Bj = AT; + 09+ Ej, 


(ii) Bj = 0177 + O27) + 63 + ;, 
(ii) Bj = 01 sin(t) + 02 cos(t;) + €;. 
Find a least squares estimator x of p = (01,02)/ and p = (61, 62,63)’ respectively. 


e Write the estimator in terms of a least squares problem || Ax — b||, = min! . 


e Write a MATLAB program which calculates, in a numerically stable fashion, the 
estimator x for each of the three models when passed the inputs b and t. 


Exercise. The coefficients (a, 6)’ € IR? of a regression line y = ax + f fitting the measure- 
ments (X1,¥1),--+,(Xm,Ym) are often given by the textbook formula 


j p=¥-ax, 


where * = i Bei Xj, XY = a pa xjyj, etc., denote the respective mean values. 


e Show that the formula can be directly obtained when the normal equation of the 
least squares problem 
lly — ax — lz = min! 


is solved with Cramer’s rule. 
e Give reasons why the numerical stability of this formula is at risk. 
e Construct a numerical example for which the formula is actually unstable. 


e How are the coefficients « and 6 computed in a numerically stable manner instead? 
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Exercise. Let A € R™*" with m > n have full column rank. 


e Show that ||b — Ax||2 = min! is equivalent to 


e Come up with a way to calculate the solution of this linear system (the sparsity of I 
and 0 can be ignored for the sake of simplicity). What is the computational cost? 


e For a o-dependent least squares problem, let x2(A) = 10” and xys = 10" be indepen- 
dent of o, but let the condition number «2(B) exhibit the following dependence: 


10-4 10-2 10° 102 104 106 108 7 


For which values of 7 would it make sense, for reasons of stability, to solve the least 
squares problem ||b — Ax||2 = min! via the system of equations with the matrix B? 


Eigenvalue Problems 


Exercise. The following figure shows the contour lines of the functions F,: A + sep(A, Ax) 
for the values 0.1 (blue) 1.0 (red) and 2.0 (green); stars mark eigenvalues: 


Im \ 


Im 


No} 


PN WR TDN © O 
Pn w eR TDN © 


& 


123 45 67 8 9 123 45 67 8 9 


e Based on the images, specify which of the matrices cannot be normal. 


e Estimate the absolute condition number of the eigenvalue A,, = 4+ 5i for Aj and Ap. 
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Exercise. Given A € C*™, show the Gershgorin circle theorem: 
a(A) Cc U Kj, where Kj = {zec: |z — a; < ¥ lai }. 
f=lenat kAj 


Hint. Examine the row of Ax = Ax, for which the absolute value of the corresponding component of x 
is largest. 


Exercise. Let A € C’”*™ and (A,x) be an eigenpair of A. The Rayleigh quotient 


x! Ax 


x x 


p(x, A) = 


shows the following perturbation behavior (¥ = x +h, h + 0): 


p(&, A) = p(x, A) +O(||h||) and, for normal matrices, p(%,A) = p(x, A) + O(|Al|*). 


Exercise. Let the normal matrices A, B and C have the following spectra: 


Aims... .Z(A)... 

9 }-0 ° 9 9 
Bfeleed : 8 8 
6 : Orb 6 6 
5 a) 5 5 
4 4 4 
3 Ob 0 3 3 
2 bk O-0 awe 2 2 
Lfoe Rex 1 1 
> 

123456789 123456789 123456789 


For each matrix, inverse iteration with shift p = 5 + 5i has been executed. The following 
figure shows the error €, = |}, — A| between the eigenvalue approximation ji; in the jth 
step and the eigenvalue A, towards which the iteration converges: 


e Towards which eigenvalue A does the iteration converge for A, B, and C? 
e Match the convergence plots (green, red, blue) with the underlying matrix. 


e Based on the spectra, calculate the convergence rates in the form e, = O(p* ). 
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Exercise. Given the eigenvalues of a normal matrix A and a shift 1, let there be 
I_—Aq| <|p—Aal< +++ <[p—Aml, |e Aa|/| — Ag| = 0.125. 


Estimate how many iteration steps k the inverse iteration will need to calculate an approxi- 
mate eigenpair (/;, 0%) with a backward error of O(€mach) in IEEE single precision. Which 
eigenpair is approximated? 


Exercise. If one were to use the current eigenvalue approximation f/,_1 as a dynamic shift 
instead of a fixed shift 149 = p in course of the inverse iteration, extremely fast convergence 
would result for real symmetric matrices: in every step, the number of correct digits in the 
eigenvalue is tripled so that machine precision is already achieved after only 3-4 steps 
(cubic convergence, cf. §21.10). This method is called the Rayleigh iteration. 


e Record the progress of 4; in course of this iteration for the matrix 
2 1 £41 
A={1 3 1 
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when the initial shift value is given by fo = 5. 


e How many steps would inverse iteration need for an example of dimension m > 1, 
in order to be actually more expensive than just 4 steps of the Rayleigh iteration? 


Exercise. Let x € C. Write a MATLAB function that when passed a previously calculated 
Schur decomposition Q’ AQ = T subsequently calculates the following data: 


e an eigenvalue A € a(A) closest to pi; 
e a corresponding normalized eigenvector x of A; 


e the backward error of the approximate eigenpair (A, x). 


Exercise. Let A,B,C € C’*"™ be given. The Sylvester equation for X € C’”*™, that is, 
AX —XB=C (*) 
can be rendered by means of Schur decompositions U’ AU = R and V'BV = S as 
RY-YS=E (**) 


where E = U'CV and Y = U'XV. 
e Find an algorithm that computes the solution Y of (**). 
Hint. Write (**) as m equations for the columns of Y. 
e Show that (*) has a unique solution if and only if 7(A) N 0(B) = ©. 
e Implement an algorithm in MATLAB that solves (*). How many flop are needed? 
Hint. Use the MATLAB command schur. It consumes O(m°) flop. 
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Exercise. Given the real matrix Ag € IR”*", let the lower right 2 x 2 block in (21.3) have 
the complex conjugate eigenvalues A, A’ ¢ 7(Ag). 


e Show that after two steps of the QR iteration with the Francis double shift, that is, 
Ag —AIL=QoRo, Ai =RoQo+AL Ar—A'I=QiRi, Ay = RQ) +A’, 


the matrix Az is also real if normalized QR decompositions were used: Az € R™*™. 


e How can the passage Ag ++ Ap be realized with real operations only? 


Notation 


K field IR or C 
W, Bp Yr+ +2 W scalars 


a,0,C,...,Z vectors (column vectors) 


a',b’,c',...,z'  co-vectors (row vectors) 


A,B,C,...,Z matrices 


i,j,l,m,n, p indices / dimensions 

1:m colon notation for 1,...,™m 

[A] Iverson bracket: 1, if expression A is true, else 0 
A’ adjoint of the matrix A 

ak, a kth column, jth row of the matrix A 

ey kth identity vector 

diag (x) the diagonal matrix made from the vector x 
GL(m; Kk) general linear group of dimension m 

Pr permutation matrix 

\|E||, TE] norm and error measure of a matrix E 

K(f; x) condition number of f in x 

«(A) condition number of the matrix A 

cond(A, x) Skeel—Bauer condition number of the system of equations Ax = b 
F the set of floating point numbers 

f1(Z) representation of ¢ as a floating point number 
Emach machine precision 


= equality after rounding 


=< equality and bound in leading order 
(A) growth factor of the matrix A 
w(x) backward error of an approximate solution ¥ 
sep(A, A) separation between A and A 
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adjunction, 3 
algorithm 
backward stable, 50 
end section, 52 
stable, 50 
start section, 52 
unstable, 50 
assembler, 14 
assessment of approximate solutions, 
60 
ATLAS Project, 13 


back substitution, 16 
backward analysis, 50 
backward error 
eigenpair, 78 
eigenvalue, 78 
linear system of equations, 60 
Bauer, Friedrich (1924-2015), 46 
beta distribution, 133 
bisection algorithm, 95 
BLAS (basic linear algebra subpro- 
grams), 8 
level 1-3, 13 
block partitioning, 7 
BLUE (best linear unbiased estima- 
tor), 70 
Bruhat decomposition, 126 
Bruhat, Francois (1929-2007), 126 
Bunch-Kaufman-—Parlett decomposi- 
tion, 115 


cache, 12 
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cancellation, 43 
cancellation free, 124 
Cardano, Gerolamo (1501-1576), 139 
Cauchy-Schwarz inequality, 119 
characteristic polynomial, 75 
chi-squared distribution, 133 
Cholesky decomposition, 29 
backward stability, 63 
Cholesky, André-Louis (1875-1918), 
30 
colon notation, 2 
column rank, 31 
compiler, 14 
condition number, 41 
linear system of equations, 46 
formula, 43 
linear regression problem, 71 
matrix, 45 
matrix product, 44 
root of a polynomial, 76 
sample variance, 59 
Skeel—Bauer, 46 
convergence 
locally cubic, 91 
locally quadratic, 91 
convergence theorem 
inverse iteration, 84 
power iteration, 82 
QR algorithm with shift (local), 
131 
QR algorithm without shift (global), 
128 
Cramer’s rule, 141 
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Cramer, Gabriel (1704-1752), 141 


data fitting, 69 

deflation, 77 

Demmel, James (1955-—), viii 
determinant, 15, 139 
Deuflhard, Peter (1944—), viii 
diagonalizability, unitary, 77 
dot product, see inner product 
Dwyer, Paul (1901-1982), 21 


efficiency ratio, 12 
eigenpair, 75 

left, 86 
eigenvalue, 75 

degenerate, 80 

dominant, 82 
eigenvalue problem, 75 

generalized, 96 

perturbation theory, 78 
eigenvector, 75 

left, 86 

normalized, 75 
eliminatio vulgaris, 21 
end section, see algorithm 
error 

absolute, 40 

approximation, 39 

backward, 50, 60, 78 

forward, 51 

measurement, 39 

model, 39 

relative, 40 

rounding, 39 

unavoidable, 49 
error measure, 40 

choice of, 40 

componentwise, 40 
error transport, 52 
experiment, 69 


floating point numbers, 47 
floating point operation, 10 


flop, see floating point operation 
forward analysis, 51 

forward substitution, 16 
Francis, John (1934—), 88 


Gauss, Carl Friedrich (1777-1855), 21 
Gaussian elimination, see triangular 
decomposition 
Gershgorin, Semyon (1901-1933), 143 
Givens rotation, 35 
Givens, Wallace (1910-1993), 35 
glancing intersection, 42 
Goldstine, Herman (1913-2004), 21 
Golub, Gene (1932-2007), viii 
Gram, Jorgen Pedersen (1850-1916), 
32 
Gram-Schmidt 
classic (CGS), 33 
modified (MGS), 33 
group properties 
permutation matrices, 19 
triangular matrices, 15 
unipotent triangular matrices, 15 
unitary matrices, 18 
growth factor of a matrix, 64 


Halmos, Paul (1916-2006), 1 
Hamming, Richard (1915-1998), 1 
hardware arithmetic, 48 
Hessenberg reduction, 93 
Hessenberg, Karl (1904-1959), 37 
Higham, Nick (1961-), 39 
Hélder’s inequality, 119 
Horn, Roger (1942-), viii 
Householder 

method, 35, 123 

reflection, 123 
Householder, Alston (1904-1993), 35 


identity, see unit matrix 
TEEE 754, 48 

in situ, 17 

inaccuracy, see error 
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index of inertia, 95 
inner product, 4 
input-output operation, 12 
instability criterion, 53 
inverse iteration, 83 
involution, 5 
iop, see input-output operation 
iterative refinement, 66 
model, 125 
one single step, 67 
Iverson bracket, 3 


JIT compiler, 14, 105 
Johnson, Charles (1948-), viii 
Jordan normal form, 77 

Julia, viii, 8 


Kahan’s algorithm, 57 

Kahan, William ,,Velvel” (1933-), 39 
Kronecker delta, 3 

Kublanovskaya, Vera (1920-2012), 88 


LAPACK, 8 
Laplace, Pierre-Simon (1749-1827), 33 
Lauchli, Peter (1928-), 72 
least squares estimator, 70 
least squares problem, 70 
orthogonalization, 72 
Q-free solution, 73 
Legendre, Adrien-Marie (1752-1833), 
70 
LLVM (low level virtual machine), 
105 
LU factorization, see triangular de- 
composition 


machine arithmetic, 48 
machine code, 14 
machine epsilon, 47 
machine numbers, 47 
double precision, 48 
single precision, 48 
machine precision, 47 


Maple, viii 

Mathematica, viii 

MATLAB, viii, 8, 99 

matrix 
adjoint, 3 
bidiagonal, 137 
block, 6 
companion, 76 
deflated, 77 
design, 69 
diagonal, 5 
diagonally dominant, 137, 141 
e-singular, 46 
Gramian, 31 
hermitian, see self-adjoint 
memory scheme, 14 
nilpotent, 15 
non-diagonalizable, 95 
non-normal, 80 
normal, 77 
numerically singular, 46 
orthogonal, see unitary 
orthonormal columns, 18 
partitioning, 7 
permutation, 19 
positive definite, 28 
principal sub-, 29 
rank-1, 5 
s.p.d., 28 
self-adjoint, 28 
symmetric, see self-adjoint 
transpose, see adjoint 
triangular, 14 

unipotent, 15 

tridiagonal, 94, 137 
unit, 5 
unitarily similar, 87 
unitary, 18 
unitary diagonalizable, 77 
upper Hessenberg, 37 
Wilkinson, 65 
with full column rank, 31 
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memory access, 10 ill conditioned, 41 

multiple dispatch, 105 ill posed, 41 

multiplicity, 75 well conditioned, 41 
product 

NaN, 48 matrix, 7 

norm, 119 matrix-vector, 4 


absolute, 122 
equivalence, 121 
Euclidean, 4, 119 
Frobenius, 119 
Hilbert-Schmidt, 119 
max-column-sum, 121 
max-row-sum, 121 
maximum, 119 
monotone, 122 
Schur, 119 
spectral, 121 
taxi cab, 119 
normal equation, 71 
notation convention, 2 
numerical analysis, 1 
NumPy, 8 


object orientation, 105 
occupancy structure, 40 
order 

column-major, 14 

row-major, 14 
orthogonal basis, 18 
orthonormal system, 18 
outer product, 5 
overflow, 48 


partial pivoting, 26 
peak execution time, 11 
peak performance, 11 
permutation, 19 
perturbation, see error 
pipelining, 11 

pivot, 23 

power iteration, 81 


principle of direct attack, 72 


problem 


projection, 122 

orthogonal, 80, 122 
pseudoinverse, 71 
pseudospectra, 80 
Python, 8 


QR algorithm, 88 

QR decomposition, 32 
backward stability, 62 
full, 36 
normalized, 32 
reduced, 36 


Rayleigh 

iteration, 144 

quotient, 78 

shift, 129 
regression analysis, 69 
regression line, 141 


regression problem, linear, 70 


residual, 60 

rounding, 47 
relative error, 47 

Runge, Carl (1856-1927), 135 


Schmidt, Erhard (1876-1959), 33 


Schur complement, 22, 137 
Schur decomposition, 77 
Schur, Issai (1875-1941), 77 
SciPy, 8 
separation, 78 
shift, 83 
Rayleigh, 91 
Wilkinson, 92 
shift strategy, 91 


singular value decomposition, 132 


Skeel, Robert (194?—), 67 
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sparsity structure, 40 
spectral radius, 122 
spectrum, 75 
stability 
backward, 50 
numerical, 50 
of an algorithm, 50 
stability analysis 
eigenvalue as root of the charac- 
teristic polynomial, 75 
evaluation of log(1 + x), 56 
linear systems of equations, 61 
matrix product, 51 
orthogonalization method to solve 
least squares problem, 73 
quadratic equation, 54 
sample variance, 58 
stability criterion, 54 
standard model of machine numbers, 
48 
start section, see algorithm 
statistics, parametric, 69 
Stewart, Gilbert ,,Pete” (1940-), 75 
Stiefel, Eduard (1909-1978), 72 
Stigler, Stephen (1941—), 69 
Stirling’s formula, 133 
submultiplicativity, 120 
sweep operator, 137 
Sylvester equation, 144 
Sylvester, James (1814-1897), 144 
symmetric group, 19 
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Gauss—Markov, 70 
Gershgorin, 143 
Kahan, 46 
normal equation, 71 
Oettli-Prager, 61 


QR decomposition, 32, 36 
Rigal—Gaches, 60 
separation, 79 
Skeel, 67 
spectral norm estimate, 132 
triangular decomposition, 27 
Wilkinson, 63 
Thomas algorithm, 137 
Thomas, Llewellyn (1903-1992), 137 
Todd, John (1911-2007), 30 
transposition, see adjunction 
Trefethen, Lloyd Nick (1955-), 50 
triangular decomposition, 21 
backward stable, 63 
normalized, 21 
triangulation, unitary, 77 
type inference, 105 


under-determined system of equations, 
74 

underflow, 48 

unit roundoff, 47 

unit vector, 3 
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co-, 2 
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observation , 69 

parameter, 69 

row, 2 
vector processor, 11 
Viéte, Francois (1540-1603), 56 
Vieta’s formula, 56 
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